githubEdit

fishStage II: Missing Dataset

This stage is a workflow for making decisions on diagnosing missingness at the dataset and column levels.

In real life, missing data is common. Although addressing it should ideally receive ample time and attention, the reality is that scientists and analysts often face tight deadlines and consequentially, may fall down a rabbit hole or fail to meet teams and personal aspirations - especially in a fast-paced environment of startup companies.

Definition: Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. #1

There is no exact correct method in handling missing dataset as this is relative to the data sample, modelling problem and technique used in achieving our end of day objective - and not to forget the measure of the unclean state the data is in. But this doesn't discern the problem from being feasible. The main challenge in automating this workflow is concluding the optimal method in diagnosing the missing values given these multitude layers of perceptions.

Using Bayes' probability, the problem on deciding the optimal method, MethodiMethod_i where i=1,2,3,...i = 1, 2, 3, ... defines the ithi^{th} method being explored in the current session can be simplified to the following expression:

P(Methodixmodel problemxdata domain descriptionxdata field)P(Method_i | x_{\text{model problem}} \wedge x_{\text{data domain description} } \wedge x_{\text{data field}})

where:

  • xmodel problemx_{\text{model problem}} : the target objective to achieve from this data sample

  • xdata domain descriptionx_{\text{data domain description}} : the type or origin of this dataset

  • xdata fieldx_{\text{data field}} : The data column containing the missing values.

Observing the problem from a programmer's point of view, the data persisted across states in this decision making cycle would be insights that highlights the currently observed sample. These insights are also persisted across agent's episodes and can be stored in BigQuery Datastore for easy access during data transformations.

Agent
Purpose
Output Format
Determinism Tactics

MissingMechanismClassifier

LLM classification of mechanism

JSON {mechanism, justification, risk_flags, confidence}

Constrained prompt + fallback parser

MissingImpactAssessor

Risk scoring across 4 domains

JSON

Defaults on parse failure

MissingStrategyRecommender

Ranked handling strategies

JSON array + rationale

Conservative fallback (mean_imputation)

(Heuristic Layer)

Pre-LLM signal extraction

Python dataclass

Deterministic numeric signals

Types of Missing Dataset

According to the sourcearrow-up-right, the first stages in distinguishing how the missing values are occuring in the data sample. Missing values can be ocuring:

Data Field Type
MAR vs MCAR
MNAR vs MAR
MNAR vs MCAR

Numeric (e.g., age, salary)

Little’s MCAR test; correlation tests between missing indicator & observed vars; logistic regression on missingness

Pattern-mixture models; Heckman selection models; sensitivity analysis with simulated unobserved values

Compare Little’s test with value-dependent dropout models; fit selection models predicting missing from true values

Binary classification

Chi-square test between missingness indicator & class; logistic regression of missingness ~ class

Logistic regression with observed covariates + EM imputation; examine residual dependence

Randomization/permutation tests vs fitted dropout models; check if imbalance remains after controlling for class

Multilabel classification

Missingness indicator regressed on known labels; random forests for feature importance

Pattern-mixture analysis across label sets; semi-supervised EM algorithms

Compare observed-label-predictive models with latent class models predicting missingness

Continuous (sensor data)

Time-series autocorrelation of missingness indicator; runs tests; logistic regression on device status

Joint modeling of outcome + missingness process (shared parameter models); survival analysis for dropout

Monte Carlo dropout simulation to test if missing aligns with extreme tails; likelihood ratio tests

Long text (speech, transcripts)

Logistic regression: missingness ~ metadata (speaker, duration, file size); clustering missing vs non-missing groups

Latent variable models on semantic embeddings; adversarial ML (train classifier to distinguish missing vs not)

Compare uniform random corruption vs topic-dependent dropout (e.g. TF-IDF clusters more missing)

Other text (name, email)

Chi-square test missingness ~ demographic groups; multiple imputation then compare fit metrics

Sensitivity analysis with reweighted likelihoods; Bayesian models with informative priors on unobserved values

Compare uniform random missing (MCAR) with models linking rare/unusual values to missingness (MNAR)

Decision-Making Tree in Distinguishing Missing Values

Quantifying the Classes Attempt

Though, this classification doesn't consider the context of the observed dataset or the data column where the missing values are being observed in. The following are drafted XOR / truth tables based on my understanding of the paper written by Kang. These matrix representations can be used to guide LLM in decision making (online learning) processes. or as validating test / rules.

chevron-rightMCARhashtag
Missing values
Non-Missing Values

Missing Values

0

0

Non-Missing values

0

0

chevron-rightMARhashtag
Missing values
Non-Missing Values

Missing Values

0

1

Non-Missing values

0

0

chevron-rightMNARhashtag
Missing values
Non-Missing Values

Missing Values

0.5

0.5

Non-Missing values

0.5

0.5

Worthy Considerations

In many applications, missing data values on its own may not be presented in a consistent form. The most common ways of missing data appearing in datasets would be empty strings, pd.NaN, None, np.nan ,.. etc.

Explored Methods for Diagnosing Missing values

The common methods in handling missing data values include:

  • Expectation-Maximization

  • Imputation Methods

  • Sensitivity analysis

  • Standardizing over the parameters from the column's approximated distribution

Concluding Actions

Actions to be concluded at this stage related to manipulating the dataset

  • Dropping the row values

  • Imputation strategies e.g. mean, mode... etc. or add extra label e.g. 'UNKNOWN' for multi-label data types.

Resources

Last updated

Was this helpful?