githubEdit

questionProblem 1: Solution

Alternative to running prompt agent over a dataset.

Objective

To validate the data types e.g. binary labels, ordinal labels etc. for feature transformations

Problem

  • While equipping BigQuery AI to a data column for efficient, it is not consistent with its output. This is the same problem with running language models with equipped tools or function decorative wrappers. - many catalysts but not enough time.

  • This stage is important because you want to apply the correct data type transformations to progress the other stages without unwanted surprises.

Solution

  • Utilizing sentence-transformers with sample data representing the data types:

    • continuous

    • binary labels

    • ordinal labels

    • multi-label categories

    • short text e.g. names, emails

    • long text e.g. chat logs

Assuming all dependency or hierarchal relations are false, the formula resonates with the canonical cosine similarity form. Suppose the target, t^\hat{t} , that maximizes cosine similarity over a normalized set qi{q_i}:

t^=argmaxt  t,qitqi \hat{t} = \arg\max_{t} \; \frac{\langle t, q_i \rangle}{\|t\| \, \|q_i\|}

Simply, the pseudocode code as steps:

  1. First compute qˉ\bar{q} as the normalized set.

  2. Then pick t^\hat{t} as the target with the highest cosine similarity to qˉ\bar{q}

The data samples representing each data column label is stored in the directory path: core/data/sample/*.csv - along with other assumptions that are used to configure agent.

Last updated

Was this helpful?