Stage II: Missing Dataset
This stage is a workflow for making decisions on diagnosing missingness at the dataset and column levels.
In real life, missing data is common. Although addressing it should ideally receive ample time and attention, the reality is that scientists and analysts often face tight deadlines and consequentially, may fall down a rabbit hole or fail to meet teams and personal aspirations - especially in a fast-paced environment of startup companies.
Definition: Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. #1
There is no exact correct method in handling missing dataset as this is relative to the data sample, modelling problem and technique used in achieving our end of day objective - and not to forget the measure of the unclean state the data is in. But this doesn't discern the problem from being feasible. The main challenge in automating this workflow is concluding the optimal method in diagnosing the missing values given these multitude layers of perceptions.
Using Bayes' probability, the problem on deciding the optimal method, Methodi where i=1,2,3,... defines the ith method being explored in the current session can be simplified to the following expression:
where:
xmodel problem : the target objective to achieve from this data sample
xdata domain description : the type or origin of this dataset
xdata field : The data column containing the missing values.
Observing the problem from a programmer's point of view, the data persisted across states in this decision making cycle would be insights that highlights the currently observed sample. These insights are also persisted across agent's episodes and can be stored in BigQuery Datastore for easy access during data transformations.
MissingMechanismClassifier
LLM classification of mechanism
JSON {mechanism, justification, risk_flags, confidence}
Constrained prompt + fallback parser
MissingImpactAssessor
Risk scoring across 4 domains
JSON
Defaults on parse failure
MissingStrategyRecommender
Ranked handling strategies
JSON array + rationale
Conservative fallback (mean_imputation)
(Heuristic Layer)
Pre-LLM signal extraction
Python dataclass
Deterministic numeric signals
Types of Missing Dataset
According to the source, the first stages in distinguishing how the missing values are occuring in the data sample. Missing values can be ocuring:
Numeric (e.g., age, salary)
Little’s MCAR test; correlation tests between missing indicator & observed vars; logistic regression on missingness
Pattern-mixture models; Heckman selection models; sensitivity analysis with simulated unobserved values
Compare Little’s test with value-dependent dropout models; fit selection models predicting missing from true values
Binary classification
Chi-square test between missingness indicator & class; logistic regression of missingness ~ class
Logistic regression with observed covariates + EM imputation; examine residual dependence
Randomization/permutation tests vs fitted dropout models; check if imbalance remains after controlling for class
Multilabel classification
Missingness indicator regressed on known labels; random forests for feature importance
Pattern-mixture analysis across label sets; semi-supervised EM algorithms
Compare observed-label-predictive models with latent class models predicting missingness
Continuous (sensor data)
Time-series autocorrelation of missingness indicator; runs tests; logistic regression on device status
Joint modeling of outcome + missingness process (shared parameter models); survival analysis for dropout
Monte Carlo dropout simulation to test if missing aligns with extreme tails; likelihood ratio tests
Long text (speech, transcripts)
Logistic regression: missingness ~ metadata (speaker, duration, file size); clustering missing vs non-missing groups
Latent variable models on semantic embeddings; adversarial ML (train classifier to distinguish missing vs not)
Compare uniform random corruption vs topic-dependent dropout (e.g. TF-IDF clusters more missing)
Other text (name, email)
Chi-square test missingness ~ demographic groups; multiple imputation then compare fit metrics
Sensitivity analysis with reweighted likelihoods; Bayesian models with informative priors on unobserved values
Compare uniform random missing (MCAR) with models linking rare/unusual values to missingness (MNAR)
Decision-Making Tree in Distinguishing Missing Values
Quantifying the Classes Attempt
Though, this classification doesn't consider the context of the observed dataset or the data column where the missing values are being observed in. The following are drafted XOR / truth tables based on my understanding of the paper written by Kang. These matrix representations can be used to guide LLM in decision making (online learning) processes. or as validating test / rules.
Worthy Considerations
In many applications, missing data values on its own may not be presented in a consistent form. The most common ways of missing data appearing in datasets would be empty strings, pd.NaN, None, np.nan ,.. etc.
Explored Methods for Diagnosing Missing values
The common methods in handling missing data values include:
Expectation-Maximization
Imputation Methods
Sensitivity analysis
Standardizing over the parameters from the column's approximated distribution
Concluding Actions
Actions to be concluded at this stage related to manipulating the dataset
Dropping the row values
Imputation strategies e.g. mean, mode... etc. or add extra label e.g. 'UNKNOWN' for multi-label data types.
Resources
Last updated
Was this helpful?

