MIT Proposes Novel End-to-End Procedure for Corrupted Data Cleaning, Estimation, and Inference

A research team from MIT proposes a unified framework for estimation and inference in the presence of various forms of economic data corruption such as measurement errors, missing values, discretization, and differential privacy.

Economic data is imperfect, and even the most carefully curated datasets can have noisy, missing, discretized, or privatized variables. Moreover, today’s standard data cleaning procedures often fail to consider the bias and variance consequences of data cleaning, which may mislead efforts at causal inference.

To alleviate these issues, a research team from MIT has proposed a unified framework for estimation and inference in the presence of various forms of data corruption such as measurement errors, missing values, discretization, and differential privacy. The team also introduces an end-to-end procedure for easy data cleaning and provides new, nonasymptotic theoretical advances for each stage of the procedure.

The study regards the goal of estimation and inference of a given target parameter as estimating a causal parameter that is a functional, scalar summary of nonparametric regression, such as a treatment effect, policy effect, or elasticity. Previous research in this area has used semiparametric theory to study the functionals of nonparametric regressions and densities without data corruption. Notably, a key insight from classic semiparametric theory is that functionals of interest typically have a Riesz representer, which establishes an important connection between a Hilbert space and its continuous dual space. In this work, the team combines data cleaning and nonparametric Riesz representer estimation into an error-in-variables problem, and extends debiased machine learning theory to the corrupted data setting.

The team explains that their simplified, automated data cleaning procedure comprises three steps: fill in missing values as zeroes, scale appropriately, then perform principal component analysis (PCA). They also innovate over the data cleaning procedure in four ways: 1) They allow for different variables to be missing with different probabilities; 2) They allow dependence of missingness within a given row; 3) They allow for technical variables constructed as transformations of original variables; 4) They introduce out-of-sample filling of missing values, which facilitates the cross-fitting required for bias correction and online learning.

After the data cleaning procedure, the team turns their attention to error-in-variables regression, which also includes three simple steps: cleaning the training set, performing ordinary least squares (OLS) on the cleaned training set, and using this OLS coefficient on the filled test set for prediction.

The researchers also propose an error-in-variables Riesz representer procedure: clean the training set, perform minimum distance estimation (MDE) on the cleaned training set, then use this MDE coefficient on the filled test set for prediction.

The paper also provides new, nonasymptotic theoretical advances in each stage of the procedure, which the team summarizes as: 1) In data cleaning, generalize matrix completion guarantees to the setting in which different variables may be missing with different probabilities; 2) In error-in-variables regression and error-in-variables bias/variance correction, prove fast average mean square error rates under the assumption that the true regressors are approximately low rank; 3) In target parameter analysis, generalize semiparametric inference guarantees to prove square n-consistency, Gaussian approximation, and semiparametric efficiency; 4) Verify the approximately low rank assumption for a broad class of generalized factor models.

Overall, the study addresses causal inference with data that may be noisy, missing, discretized or privatized with a novel end-to-end procedure that involves data cleaning via matrix completion, estimation via new variants of principal component regression, and inference via doubly robust moments.

For future research, the team suggests exploring how to extend their approach to settings with confounded noise and sample selection bias, and their paper provides a template for doing so.

The paper Causal Inference with Corrupted Data: Measurement Error, Missing Values, Discretization, and Differential Privacy is on arXiv.

Author: Hecate He | Editor: Michael Sarazen, Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

About Synced

Machine Intelligence | Technology & Industry | Information & Analysis

2 comments on “MIT Proposes Novel End-to-End Procedure for Corrupted Data Cleaning, Estimation, and Inference”

1. Veronika

Saving important data is an important task, especially for large companies. The loss of documents, reports, important contacts can lead to direct financial losses, even the loss of customers and partners. It is important to backup data in a timely manner, saving all important documents on additional storage devices. Take care of timely data protection with advanced Mac Backup Software