Machine learning to study causality with big datasets: towards methods yielding valid statistical conclusions

Research project In this project we focus on measuring uncertainty when using machine learning geared at estimation of causal relationships using big databases. Our overall purpose is to develop novel methods that yield valid statistical conclusions on causal effects when using machine learning algorithms. The project will provide important new tools that will allow social and health scientists to obtain valid statistical conclusions when conducting large-scale observational studies of causal pathways.

The project outputs will include valid measures of uncertainty, typically confidence intervals for causal effects, when using specific causal machine learning algorithms (e.g. causal effects estimated using Neural Networks, Random Forest, etc.). Moreover, we will also provide measures of uncertainty that take into account the potential existence of unobserved confounders in high-dimensional settings. Finally, by studying efficiency bounds for different identification strategies, resulting causal machine learning methods will be developed to yield optimal and yet uniformly valid methods.

Head of project

Xavier de Luna Professor

Email

+46 90 786 55 59

Project overview

Project period:

2022-06-01 – 2026-12-31

Funding

6’000’000 SEK, Marianne and Marcus Wallenberg Stiftelse, PI: Xavier de Luna

Participating departments and units at Umeå University

Faculty of Social Sciences, Umeå School of Business, Economics and Statistics

Research area

Statistics

Project members

Tetiana Gorbach Associate professor

Email

+46 90 786 72 01

Per Gustafsson Professor

Email

+46 90 786 95 63

Mohammad Ghasempour Postdoctoral fellow

Email

+46 90 786 54 57

External project members

Professor Yanyuan Ma, Penn State University

Project description

Large-scale observational databases built through linked socioeconomic and health registers contain millions of individuals and as many features for each individual, including not only their own life histories, but those of their household members, other family members, working colleagues, neighbours, etc. Such data infrastructures provide unique opportunities to discover causal pathways and test related scientific hypotheses, thereby yielding results with policy relevance.

Machine learning methods are essential to deal with the high-dimensionality and complexity of these large databases. While extensive research has been carried out and is still on-going on machine learning for automation, estimation and prediction, much less effort is being put into measuring uncertainty arising from these outputs of machine learning algorithms. This is true, in particular, when the output is an estimate of a causal effect. Without measures of uncertainty (e.g., in the form of confidence intervals), it is not possible to draw relevant statistical conclusions on the existence and size of causal effects from observational databases.

In this project we have gathered expertise to be able to fill major existing gaps in this area, with a focus on measuring uncertainty when using machine learning geared at discovery and estimation of causal relationships using big databases. The overall purpose of the project is to develop novel methods that yield valid statistical conclusions (inference) on causal effects when using machine learning algorithms and big datasets.

This project will provide important new tools that will allow social and health scientists to obtain valid statistical conclusions when conducting large-scale observational studies of causal pathways.

Latest update: 2022-11-07