Data Cleaning

Documents on the procedures in cleaning and transforming datasets. This document details how I enabled Gaby to self-orchestrate his decision-making processes.

Objective

Robust data-cleaning stages that adapt to any dataset, while allowing reasoning AI agents to steer themselves—balancing playful, curious exploration (explore) with more traditional, goal-driven strategies (greedy/exploit).

Environment Scenario

This is further divided based on whether the user has defined their ML/AI modeling objective.

User Inputs:

User Input

Description (and data type)

Required? (True=Yes, False=No e.g. True if user must provide else otherwise)

Uncleaned data sample so fields would be of object or string data types.

.json, csv, parquet,

IP Subnet Address to connect to database.

Yes

Data descriptors chips - short / long text that describes the dataset e.g. 'finance' or 'This is a dataset originated from finance sale logs'

string

Yes

Modelling Objective e.g. To model house predictions for the next financial Quarter.

string

Data Cleaning General Steps

The following table outlines the data cleaning stages required for all datasets regardless of the type of data or modelling problem.

(
    "Summarize, document, identify and understand the dataset and data columns.",
    "Identify and process the missing Data Values existing per data field",
    "Reason and identify strategies in handling duplicated data by row & cols",
    "Identify and process anomality",
    "Transforming data columns, for example, encoding binary labels."
)

PreviousGaby AI Agent Features NextStage I: Defining and Understanding the Dataset

Last updated 25 days ago

Was this helpful?

hashtagObjective

hashtagEnvironment Scenario

hashtagData Cleaning General Steps

Objective

Environment Scenario

Data Cleaning General Steps