Problem 1: Solution

Alternative to running prompt agent over a dataset.

Objective

To validate the data types e.g. binary labels, ordinal labels etc. for feature transformations

Problem

While equipping BigQuery AI to a data column for efficient, it is not consistent with its output. This is the same problem with running language models with equipped tools or function decorative wrappers. - many catalysts but not enough time.
This stage is important because you want to apply the correct data type transformations to progress the other stages without unwanted surprises.

Solution

Utilizing sentence-transformers with sample data representing the data types:
- continuous
- binary labels
- ordinal labels
- multi-label categories
- short text e.g. names, emails
- long text e.g. chat logs

Assuming all dependency or hierarchal relations are false, the formula resonates with the canonical cosine similarity form. Suppose the target, $\hat{t}$ , that maximizes cosine similarity over a normalized set ${q_i}$ :

$\hat{t} = \arg\max_{t} \; \frac{\langle t, q_i \rangle}{\|t\| \, \|q_i\|}$

Simply, the pseudocode code as steps:

First compute $\bar{q}$ as the normalized set.
Then pick $\hat{t}$ as the target with the highest cosine similarity to $\bar{q}$

The data samples representing each data column label is stored in the directory path: core/data/sample/*.csv - along with other assumptions that are used to configure agent.

PreviousProblem 1: Inconsistent Data type validations and labelling NextStage II: Missing Dataset

Last updated 4 months ago

Was this helpful?

hashtagObjective

hashtagProblem

hashtagSolution

Objective

Problem

Solution