# BigQuery AI - Building the Future of Data

## Hackathon&#x20;

### Objective / Motivation

In this competition, my objective was to build more robustness in my way of designing or programming projects.

**Submission Objective**

*<mark style="color:$success;">**Robust data-cleaning stages that adapt to any dataset, while allowing reasoning AI agents to steer themselves—balancing playful, curious exploration (explore) with more traditional, goal-driven strategies (greedy/exploit).**</mark>*&#x20;

### Brief

{% embed url="<https://www.kaggle.com/competitions/bigquery-ai-hackathon/overview>" %}
Hackathon Link
{% endembed %}

The competition requires a POC that utilizes Google BigQuery and VectorAI to manage datasets. Gaby wasn't built for this competition but thought it was a good idea seeing that his objective aligned with most of the 'inspiration' projects listed in the competition. The main challenges are:

* Adapting Gaby's current data processors to Google BigQuery.
* Gaby's client interaction interface has not been configured yet so `livekit` is used to assist in serving the dashboard platform and handling user interactions (Text-To-Speech (TTS) .. etc)&#x20;

The POC demo Gaby's capabilities was showcasing how he can utilise the current tech stack + architecture to self-orchestrated himself cleaning dataset. The dataset used: <https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training>

The selected dataset was to reflect on real-life problems where the dataset is never truly as perfect as the public dataset. Other datasets considered include text datasets with label classifications being biased to the labeler.&#x20;

### Requirement Checklist&#x20;

<table data-full-width="true"><thead><tr><th>Requirement</th><th align="center">Outcome / Link</th><th data-type="checkbox">Requirement met? (Ticked = Yes)</th></tr></thead><tbody><tr><td>Use at least 3 BigQuery AI/ML Methods in the competition.</td><td align="center"><a data-mention href="/pages/zl9Y4JweKgPiyFUrUDaH">/pages/zl9Y4JweKgPiyFUrUDaH</a></td><td>true</td></tr><tr><td>Upload a Jupyter Notebook that must run when played</td><td align="center">My Google Cloud and App Development Account was banned a day prior to submission.</td><td>false</td></tr><tr><td>Link Blog / App Demo </td><td align="center">Linked this blog but no video on how to use.</td><td>true</td></tr></tbody></table>

### Submission (Final Verdict)

Spent a week refactoring the project to meet the requirements. Unfortunately, I did not setup my Google Cloud account correctly and this may have provoked my ban. This also affected my frontend deployment where it was being served over Firebase Studio.

### Resources Used

<details>

<summary>Template Resource</summary>

<https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/kaggle/bq_dataframes_ai_forecast.ipynb>

</details>

<details>

<summary>Github / Notebooks / Python Libraries</summary>

<https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/generative_ai/bq_dataframes_llm_code_generation.ipynb>

</details>

## Revision & Self-Feedback&#x20;

Setbacks & Improvements

* More extensive time needed in setting up any tech stack.
* Did not spend enough time planning.
* Further subsetting the problem into smaller chunks or problems - the sub-categories or objectives I planned for myself were still too hard to test in the given timeframe. I realised that when building prompt chains, the important test cases are the ones that could also be applied to a larger sample or if not identical, similar situations.&#x20;

The Good

* Documented every step.&#x20;

Changes along the way

The initial framework of the database lifecycle stages were:

* Data Profiling & Documentation - with BigQuery AI
* Data Cleaning, Engineering & Transformation - with BigQuery AI
* Data Analytics & Chart Visualisations and Business Insights - with BigQuery AI for persisting data
* Database Project Manager - MCP / Dockers
* Data Quality Testing & Assurance - with BigQuery AI
* Data Science Team's Trajectory Path Paver - MCP / Dockers

These objectives were still too broad to implement in the timeframe so I have decided to revise the database lifecycle and concluded with the following list for the project:

<figure><img src="/files/wm0gbwNLeV11oAEp3HyW" alt="" width="188"><figcaption></figcaption></figure>

The most challenging stage is data wrangling, and addressing unclean datasets should be Gaby AI’s primary focus. From my current build, I’ve realized that once the data wrangling skeleton pipeline is firmly established, implementing the remaining lifecycle stages becomes significantly easier. Any unforeseen issues that arise are more likely to stem from the way external application integrations are constructed.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://whoamimi.gitbook.io/blog/projects/readme-1/participated-competitions/bigquery-ai-building-the-future-of-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
