# Stage II: Missing Dataset

In real life, missing data is common. Although addressing it should ideally receive ample time and attention, the reality is that scientists and analysts often face tight deadlines and consequentially, may fall down a rabbit hole or fail to meet teams and personal aspirations - especially in a fast-paced environment of startup companies.&#x20;

> Definition: Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. #1

There is no exact correct method in handling missing dataset as this is relative to the data sample, modelling problem and technique used in achieving our end of day objective - and not to forget the measure of the unclean state the data is in. But this doesn't discern the problem from being *feasible*. The main challenge in automating this workflow is concluding the optimal method in diagnosing the missing values given these multitude layers of perceptions.

Using Bayes' probability, the problem on deciding the optimal method, $$Method\_i$$  where $$i = 1, 2, 3, ...$$ defines the $$i^{th}$$ method being explored in the current session can be simplified to the following expression:&#x20;

$$
P(Method\_i | x\_{\text{model problem}} \wedge x\_{\text{data domain description} } \wedge x\_{\text{data field}})
$$

where:

* $$x\_{\text{model problem}}$$ : the target objective to achieve from this data sample
* $$x\_{\text{data domain description}}$$ : the type or origin of this dataset
* $$x\_{\text{data field}}$$ : The data column containing the missing values.

Observing the problem from a programmer's point of view, the data persisted across states in this decision making cycle would be insights that highlights the currently observed sample. These insights  are also persisted across agent's episodes and can be stored in BigQuery Datastore for easy access during data transformations.

| Agent                      | Purpose                         | Output Format                                            | Determinism Tactics                      |
| -------------------------- | ------------------------------- | -------------------------------------------------------- | ---------------------------------------- |
| MissingMechanismClassifier | LLM classification of mechanism | JSON {mechanism, justification, risk\_flags, confidence} | Constrained prompt + fallback parser     |
| MissingImpactAssessor      | Risk scoring across 4 domains   | JSON                                                     | Defaults on parse failure                |
| MissingStrategyRecommender | Ranked handling strategies      | JSON array + rationale                                   | Conservative fallback (mean\_imputation) |
| (Heuristic Layer)          | Pre-LLM signal extraction       | Python dataclass                                         | Deterministic numeric signals            |

## Types of Missing Dataset

According to the [source](https://pmc.ncbi.nlm.nih.gov/articles/PMC3668100/#sec1), the first stages in distinguishing how the missing values are occuring in the data sample. Missing values can be ocuring:

| Data Field Type                 | MAR vs MCAR                                                                                                           | MNAR vs MAR                                                                                                    | MNAR vs MCAR                                                                                                        |
| ------------------------------- | --------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
| Numeric (e.g., age, salary)     | Little’s MCAR test; correlation tests between missing indicator & observed vars; logistic regression on missingness   | Pattern-mixture models; Heckman selection models; sensitivity analysis with simulated unobserved values        | Compare Little’s test with value-dependent dropout models; fit selection models predicting missing from true values |
| Binary classification           | Chi-square test between missingness indicator & class; logistic regression of missingness \~ class                    | Logistic regression with observed covariates + EM imputation; examine residual dependence                      | Randomization/permutation tests vs fitted dropout models; check if imbalance remains after controlling for class    |
| Multilabel classification       | Missingness indicator regressed on known labels; random forests for feature importance                                | Pattern-mixture analysis across label sets; semi-supervised EM algorithms                                      | Compare observed-label-predictive models with latent class models predicting missingness                            |
| Continuous (sensor data)        | Time-series autocorrelation of missingness indicator; runs tests; logistic regression on device status                | Joint modeling of outcome + missingness process (shared parameter models); survival analysis for dropout       | Monte Carlo dropout simulation to test if missing aligns with extreme tails; likelihood ratio tests                 |
| Long text (speech, transcripts) | Logistic regression: missingness \~ metadata (speaker, duration, file size); clustering missing vs non-missing groups | Latent variable models on semantic embeddings; adversarial ML (train classifier to distinguish missing vs not) | Compare uniform random corruption vs topic-dependent dropout (e.g. TF-IDF clusters more missing)                    |
| Other text (name, email)        | Chi-square test missingness \~ demographic groups; multiple imputation then compare fit metrics                       | Sensitivity analysis with reweighted likelihoods; Bayesian models with informative priors on unobserved values | Compare uniform random missing (MCAR) with models linking rare/unusual values to missingness (MNAR)                 |

## Decision-Making Tree in Distinguishing Missing Values

```
                          ┌───────────────────────────┐
                          │      Start: Missing?      │
                          └─────────────┬─────────────┘
                                        │
                          ┌─────────────┴─────────────┐
                          │                           │
                 ┌────────▼─────────┐       ┌─────────▼─────────┐
                 │ Test MCAR (Is     │       │   No missing data │
                 │ missingness       │       │ → Done            │
                 │ completely random?)│       └──────────────────┘
                 └─────────┬─────────┘
                           │
         ┌─────────────────┼──────────────────┐
         │                                    │
 ┌───────▼────────┐                 ┌─────────▼─────────┐
 │ MCAR confirmed │                 │ MCAR rejected     │
 │ (Little’s test │                 │ (missingness not  │
 │ not rejected,  │                 │ purely random)    │
 │ no predictors) │                 └─────────┬─────────┘
 └────────────────┘                           │
                                              │
                                 ┌────────────▼──────────────┐
                                 │ Test MAR (Dependence on   │
                                 │ observed variables?)      │
                                 └────────────┬──────────────┘
                                              │
                        ┌─────────────────────┼─────────────────────┐
                        │                                           │
             ┌──────────▼─────────┐                     ┌───────────▼──────────┐
             │ MAR confirmed      │                     │ Residual dependence   │
             │ (missingness can   │                     │ remains after all     │
             │ be explained by    │                     │ observed variables    │
             │ known covariates)  │                     │ tested                │
             └──────────┬─────────┘                     └───────────┬──────────┘
                        │                                           │
             ┌──────────▼─────────┐                     ┌───────────▼───────────┐
             │ Treat as MAR for   │                     │ Suspect MNAR          │
             │ imputation models  │                     │ (missingness depends  │
             │ and ML pipelines   │                     │ on unobserved value   │
             └────────────────────┘                     │ itself; use selection │
                                                        │ models, sensitivity   │
                                                        │ analysis)             │
                                                        └───────────────────────┘
```

## Quantifying the Classes Attempt

Though, this classification doesn't consider the context of the observed dataset or the data column where the missing values are being observed in. The following are drafted XOR / truth tables based on my understanding of the paper written by Kang. These matrix representations can be used to guide LLM in decision making (online learning) processes. or as validating test / rules.

<details>

<summary>MCAR</summary>

|                    | Missing values | Non-Missing Values |
| ------------------ | -------------- | ------------------ |
| Missing Values     | 0              | 0                  |
| Non-Missing values | 0              | 0                  |

</details>

<details>

<summary>MAR</summary>

|                    | Missing values | Non-Missing Values |
| ------------------ | -------------- | ------------------ |
| Missing Values     | 0              | 1                  |
| Non-Missing values | 0              | 0                  |

</details>

<details>

<summary>MNAR</summary>

|                    | Missing values | Non-Missing Values |
| ------------------ | -------------- | ------------------ |
| Missing Values     | 0.5            | 0.5                |
| Non-Missing values | 0.5            | 0.5                |

</details>

## Worthy Considerations

In many applications, missing data values on its own may not be presented in a consistent form.  The most common ways of missing data appearing in datasets would be empty strings, `pd.NaN`, `None`, `np.nan` ,.. etc.&#x20;

## Explored Methods for Diagnosing Missing values

The common methods in handling missing data values include:

* Expectation-Maximization
* Imputation Methods&#x20;
* Sensitivity analysis
* Standardizing over the parameters from the column's approximated distribution&#x20;

## Concluding Actions&#x20;

Actions to be concluded at this stage related to manipulating the dataset

* Dropping the row values
* Imputation strategies e.g. mean, mode... etc. or add extra label e.g. 'UNKNOWN' for multi-label data types.

#### Resources

{% embed url="<https://pmc.ncbi.nlm.nih.gov/articles/PMC3668100/#sec1>" %}

{% embed url="<https://pmc.ncbi.nlm.nih.gov/articles/PMC3668100/>" %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://whoamimi.gitbook.io/blog/projects/readme-1/gaby-ai-agent-features/data-cleaning/stage-ii-missing-dataset.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
