# Data Cleaning

## Objective&#x20;

*<mark style="color:$success;">**Robust data-cleaning stages that adapt to any dataset, while allowing reasoning AI agents to steer themselves—balancing playful, curious exploration (explore) with more traditional, goal-driven strategies (greedy/exploit).**</mark>*&#x20;

## Environment Scenario

This is further divided based on whether the user has defined their ML/AI modeling objective.&#x20;

User Inputs:

| User Input                                                                                                                                    | Description (and data type)                                                                                                  | Required? (True=Yes, False=No e.g.  True if user must provide else otherwise) |
| --------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| Uncleaned data sample so fields would be of `object` or `string` data types.                                                                  | <p><code>.json</code>, <code>csv</code>, <code>parquet,</code></p><p>IP Subnet Address to connect to database.</p><p>   </p> | Yes                                                                           |
| Data descriptors chips - short / long text that describes the dataset e.g. 'finance' or 'This is a dataset originated from finance sale logs' | `string`                                                                                                                     | Yes                                                                           |
| Modelling Objective e.g. To model house predictions for the next financial Quarter.                                                           | `string`                                                                                                                     | No                                                                            |

## Data Cleaning General Steps

The following table outlines the data cleaning stages required for all datasets regardless of the type of data or modelling problem.

```python
(
    "Summarize, document, identify and understand the dataset and data columns.",
    "Identify and process the missing Data Values existing per data field",
    "Reason and identify strategies in handling duplicated data by row & cols",
    "Identify and process anomality",
    "Transforming data columns, for example, encoding binary labels."
)
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://whoamimi.gitbook.io/blog/projects/readme-1/gaby-ai-agent-features/data-cleaning.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
