Methods for improving estimation of causal effects in observational studies

Research project
The Nordic countries, through the availability of administrative and health registers (and the possibility to link them for research purposes), have world unique infrastructures to perform observational studies. The purpose of such studies is often to evaluate the effect of a causal variable on an outcome of interest, such as the effect of a labor market program on time in unemployment or the effect of adult education on time of retirement.

In such studies there is a need to control for background variables (eg, education, income, gender) that affect both the causal agent and the outcome of interest. In this context, non-parametric methods are increasingly used because they require few model assumptions. The latter flexibility is important in studies where many individuals are observed which is common in the case of observational studies. The purpose of this project is to develop and study methods for selecting background variables and to study properties of nonparametric estimators of causal parameters when this and other necessary model choices are taken into consideration. The project results are expected to be relevant for researchers involved in register based research and the methods are intended to be implemented in freely available software, making flexible estimation of causal effects more accessible to practioners.

In practice one always have more than one covariate and to ensure that unconfoundedness holds we typically want to control for a large number of covariates, i.e., we have a high-dimensional covariate vector. This, however, poses a problem when using smoothers, known as the curse of dimensionality. A solution, in this context, would be to replace the covariate vector with the true propensity score. Thus, we would get a simple estimator of the average treatment effect (ACE) by taking the mean difference of two nonparametric regressions fits using the propensity score as the only covariate. However, in practice the propensity score also needs to be estimated (since it is unknown when treatment is not randomized) and this implies reducing the dimension of the covariate vector. Thus, for us to be able to use such an estimator of ACE in practice there are several choices to be made: We have to choose which covariates to include in the propensity score model; how the propensity score should be estimated; and, for any smoother used in our chosen ACE estimator, select two smoothing parameters. To broaden the practical use of the method we plan to study the properties of the ACE estimator when selection of covariates, propensity score estimation and smoothing parameter selection are all taken into account.

The starting point for covariate selection should be subject matter knowledge, but since the latter typically give only partial guidance data-driven methods are used to reduce the set of potential confounders to a set of covariates which is more relevant.

The propensity score is commonly estimated by logistic regression, that is, one assumes a parametric functional form for the propensity score. In our effort to loosen the assumptions as much as possible we would prefer a non- or semiparametric estimator.

The properties of the chosen smoothing parameter selector and the resulting ACE estimator will be studied, both theoretically and through simulations. More precisely we want to propose and study a criterion for the simultaneous choice of confounders and smoothing parameters. A major aim here is to make the methods developed and studied in this project available to practitioners by providing user friendly software within the R-project, i.e., freely available on the internet.