Alternative to running prompt agent over a dataset.
Objective
To validate the data types e.g. binary labels, ordinal labels etc. for feature transformations
Problem
While equipping BigQuery AI to a data column for efficient, it is not consistent with its output. This is the same problem with running language models with equipped tools or function decorative wrappers. - many catalysts but not enough time.
This stage is important because you want to apply the correct data type transformations to progress the other stages without unwanted surprises.
Solution
Utilizing sentence-transformers with sample data representing the data types:
continuous
binary labels
ordinal labels
multi-label categories
short text e.g. names, emails
long text e.g. chat logs
Assuming all dependency or hierarchal relations are false, the formula resonates with the canonical cosine similarity form. Suppose the target, t^ , that maximizes cosine similarity over a normalized set qi:
t^=argmaxt∥t∥∥qi∥⟨t,qi⟩
Simply, the pseudocode code as steps:
First compute qˉ as the normalized set.
Then pick t^ as the target with the highest cosine similarity to qˉ
The data samples representing each data column label is stored in the directory path: core/data/sample/*.csv - along with other assumptions that are used to configure agent.