Stage V: The End!

Contains the post database management and housekeeping methods to run when process is done.

Running Workflows

A pre-configured sample size, N, is drawn as subsets of mean-pooled tabular datasets. These subsets contain only the raw values (columns names are excluded). Each subset is then projected into a latent space with reduced dimensionality, aligned to the hidden block size of the sentence-transformer module.

From there, the transformed data is passed into scikit-learn or BigQuery PCA to further reduce its dimensionality, ensuring the dataset shape falls below 384 dimensions (the default embedding size of the model in use). PCA applies a numerical method, singular value decomposition (SVD), to retain the components with the highest explanatory variance.

Database Schema

End reports for each previous stage in this pipeline.

User Interaction - Frontend

At the end of this workflow, the following options are triggered to the user:

Download cleaning report (least priority - connecting to their database's subnet IP address, but this is not available for some datastores or that it can be more difficult to implement this)
Proceed to next stage:
1. Generate Database current quality scores OR
2. Data insights & significance

PreviousStage IV: Anomaly Detection & Handling NextData Insights & Analytics

Last updated 25 days ago

Was this helpful?

hashtagRunning Workflows

hashtagDatabase Schema

hashtagUser Interaction - Frontend

Running Workflows

Database Schema

User Interaction - Frontend