githubEdit

broom-wideData Cleaning

Documents on the procedures in cleaning and transforming datasets. This document details how I enabled Gaby to self-orchestrate his decision-making processes.

Objective

Robust data-cleaning stages that adapt to any dataset, while allowing reasoning AI agents to steer themselves—balancing playful, curious exploration (explore) with more traditional, goal-driven strategies (greedy/exploit).

Environment Scenario

This is further divided based on whether the user has defined their ML/AI modeling objective.

User Inputs:

User Input
Description (and data type)
Required? (True=Yes, False=No e.g. True if user must provide else otherwise)

Uncleaned data sample so fields would be of object or string data types.

.json, csv, parquet,

IP Subnet Address to connect to database.

Yes

Data descriptors chips - short / long text that describes the dataset e.g. 'finance' or 'This is a dataset originated from finance sale logs'

string

Yes

Modelling Objective e.g. To model house predictions for the next financial Quarter.

string

No

Data Cleaning General Steps

The following table outlines the data cleaning stages required for all datasets regardless of the type of data or modelling problem.

(
    "Summarize, document, identify and understand the dataset and data columns.",
    "Identify and process the missing Data Values existing per data field",
    "Reason and identify strategies in handling duplicated data by row & cols",
    "Identify and process anomality",
    "Transforming data columns, for example, encoding binary labels."
)

Last updated

Was this helpful?