githubEdit

fishStage V: The End!

Contains the post database management and housekeeping methods to run when process is done.

Running Workflows

A pre-configured sample size, N, is drawn as subsets of mean-pooled tabular datasets. These subsets contain only the raw values (columns names are excluded). Each subset is then projected into a latent space with reduced dimensionality, aligned to the hidden block size of the sentence-transformer module.

From there, the transformed data is passed into scikit-learn or BigQuery PCA to further reduce its dimensionality, ensuring the dataset shape falls below 384 dimensions (the default embedding size of the model in use). PCA applies a numerical method, singular value decomposition (SVD), to retain the components with the highest explanatory variance.

Database Schema

  • End reports for each previous stage in this pipeline.

User Interaction - Frontend

At the end of this workflow, the following options are triggered to the user:

  1. Download cleaning report (least priority - connecting to their database's subnet IP address, but this is not available for some datastores or that it can be more difficult to implement this)

  2. Proceed to next stage:

    1. Generate Database current quality scores OR

    2. Data insights & significance

Last updated

Was this helpful?