Stage II: Missing Dataset

This stage is a workflow for making decisions on diagnosing missingness at the dataset and column levels.

In real life, missing data is common. Although addressing it should ideally receive ample time and attention, the reality is that scientists and analysts often face tight deadlines and consequentially, may fall down a rabbit hole or fail to meet teams and personal aspirations - especially in a fast-paced environment of startup companies.

Definition: Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. #1

There is no exact correct method in handling missing dataset as this is relative to the data sample, modelling problem and technique used in achieving our end of day objective - and not to forget the measure of the unclean state the data is in. But this doesn't discern the problem from being feasible. The main challenge in automating this workflow is concluding the optimal method in diagnosing the missing values given these multitude layers of perceptions.

Using Bayes' probability, the problem on deciding the optimal method, $Method_i$ where $i = 1, 2, 3, ...$ defines the $i^{th}$ method being explored in the current session can be simplified to the following expression:

P(Method_i | x_{\text{model problem}} \wedge x_{\text{data domain description} } \wedge x_{\text{data field}})

where:

$x_{\text{model problem}}$ : the target objective to achieve from this data sample
$x_{\text{data domain description}}$ : the type or origin of this dataset
$x_{\text{data field}}$ : The data column containing the missing values.

Observing the problem from a programmer's point of view, the data persisted across states in this decision making cycle would be insights that highlights the currently observed sample. These insights are also persisted across agent's episodes and can be stored in BigQuery Datastore for easy access during data transformations.

Agent

Purpose

Output Format

Determinism Tactics

MissingMechanismClassifier

LLM classification of mechanism

JSON {mechanism, justification, risk_flags, confidence}

Constrained prompt + fallback parser

MissingImpactAssessor

Risk scoring across 4 domains

JSON

Defaults on parse failure

MissingStrategyRecommender

Ranked handling strategies

JSON array + rationale

Conservative fallback (mean_imputation)

(Heuristic Layer)

Pre-LLM signal extraction

Python dataclass

Deterministic numeric signals

Types of Missing Dataset

According to the source, the first stages in distinguishing how the missing values are occuring in the data sample. Missing values can be ocuring:

Data Field Type

MAR vs MCAR

MNAR vs MAR

MNAR vs MCAR

Numeric (e.g., age, salary)

Little’s MCAR test; correlation tests between missing indicator & observed vars; logistic regression on missingness

Pattern-mixture models; Heckman selection models; sensitivity analysis with simulated unobserved values

Compare Little’s test with value-dependent dropout models; fit selection models predicting missing from true values

Binary classification

Chi-square test between missingness indicator & class; logistic regression of missingness ~ class

Logistic regression with observed covariates + EM imputation; examine residual dependence

Randomization/permutation tests vs fitted dropout models; check if imbalance remains after controlling for class

Multilabel classification

Missingness indicator regressed on known labels; random forests for feature importance

Pattern-mixture analysis across label sets; semi-supervised EM algorithms

Compare observed-label-predictive models with latent class models predicting missingness

Continuous (sensor data)

Time-series autocorrelation of missingness indicator; runs tests; logistic regression on device status

Joint modeling of outcome + missingness process (shared parameter models); survival analysis for dropout

Monte Carlo dropout simulation to test if missing aligns with extreme tails; likelihood ratio tests

Long text (speech, transcripts)

Logistic regression: missingness ~ metadata (speaker, duration, file size); clustering missing vs non-missing groups

Latent variable models on semantic embeddings; adversarial ML (train classifier to distinguish missing vs not)

Compare uniform random corruption vs topic-dependent dropout (e.g. TF-IDF clusters more missing)

Other text (name, email)

Chi-square test missingness ~ demographic groups; multiple imputation then compare fit metrics

Sensitivity analysis with reweighted likelihoods; Bayesian models with informative priors on unobserved values

Compare uniform random missing (MCAR) with models linking rare/unusual values to missingness (MNAR)

Decision-Making Tree in Distinguishing Missing Values

                          ┌───────────────────────────┐
                          │      Start: Missing?      │
                          └─────────────┬─────────────┘
                                        │
                          ┌─────────────┴─────────────┐
                          │                           │
                 ┌────────▼─────────┐       ┌─────────▼─────────┐
                 │ Test MCAR (Is     │       │   No missing data │
                 │ missingness       │       │ → Done            │
                 │ completely random?)│       └──────────────────┘
                 └─────────┬─────────┘
                           │
         ┌─────────────────┼──────────────────┐
         │                                    │
 ┌───────▼────────┐                 ┌─────────▼─────────┐
 │ MCAR confirmed │                 │ MCAR rejected     │
 │ (Little’s test │                 │ (missingness not  │
 │ not rejected,  │                 │ purely random)    │
 │ no predictors) │                 └─────────┬─────────┘
 └────────────────┘                           │
                                              │
                                 ┌────────────▼──────────────┐
                                 │ Test MAR (Dependence on   │
                                 │ observed variables?)      │
                                 └────────────┬──────────────┘
                                              │
                        ┌─────────────────────┼─────────────────────┐
                        │                                           │
             ┌──────────▼─────────┐                     ┌───────────▼──────────┐
             │ MAR confirmed      │                     │ Residual dependence   │
             │ (missingness can   │                     │ remains after all     │
             │ be explained by    │                     │ observed variables    │
             │ known covariates)  │                     │ tested                │
             └──────────┬─────────┘                     └───────────┬──────────┘
                        │                                           │
             ┌──────────▼─────────┐                     ┌───────────▼───────────┐
             │ Treat as MAR for   │                     │ Suspect MNAR          │
             │ imputation models  │                     │ (missingness depends  │
             │ and ML pipelines   │                     │ on unobserved value   │
             └────────────────────┘                     │ itself; use selection │
                                                        │ models, sensitivity   │
                                                        │ analysis)             │
                                                        └───────────────────────┘

Quantifying the Classes Attempt

Though, this classification doesn't consider the context of the observed dataset or the data column where the missing values are being observed in. The following are drafted XOR / truth tables based on my understanding of the paper written by Kang. These matrix representations can be used to guide LLM in decision making (online learning) processes. or as validating test / rules.

MCAR

Missing values

Non-Missing Values

Missing Values

Non-Missing values

MAR

Missing values

Non-Missing Values

Missing Values

Non-Missing values

MNAR

Missing values

Non-Missing Values

Missing Values

0.5

Non-Missing values

0.5

Worthy Considerations

In many applications, missing data values on its own may not be presented in a consistent form. The most common ways of missing data appearing in datasets would be empty strings, pd.NaN, None, np.nan ,.. etc.

Explored Methods for Diagnosing Missing values

The common methods in handling missing data values include:

Expectation-Maximization
Imputation Methods
Sensitivity analysis
Standardizing over the parameters from the column's approximated distribution

Concluding Actions

Actions to be concluded at this stage related to manipulating the dataset

Dropping the row values
Imputation strategies e.g. mean, mode... etc. or add extra label e.g. 'UNKNOWN' for multi-label data types.

Resources

The prevention and handling of the missing dataPubMed Central (PMC)

PreviousProblem 1: Solution NextHeckman Sample Selection Model

Last updated 25 days ago

Was this helpful?

hashtagTypes of Missing Dataset

hashtagDecision-Making Tree in Distinguishing Missing Values

hashtagQuantifying the Classes Attempt

hashtagWorthy Considerations

hashtagExplored Methods for Diagnosing Missing values

hashtagConcluding Actions

hashtagResources

Types of Missing Dataset

Decision-Making Tree in Distinguishing Missing Values

Quantifying the Classes Attempt

Worthy Considerations

Explored Methods for Diagnosing Missing values

Concluding Actions

Resources