Data subsampling, active learning, and optimal design
Tuesday 26 April, 2022at 13:15 - 14:00
MIT.A.356 and Zoom
Abstract: Data subsampling has become increasingly used in the statistics and machine learning community to overcome practical, economical and computational bottlenecks in modern large scale inference problems. Examples include leverage sampling for big data linear regression, optimal subdata selection for generalised linear models, and active machine learning in measurement constrained supervised learning problems.
So far, the contributions to the field have been largely focused on computationally efficient algorithmic developments. Consequently, most sampling schemes proposed in the literature are either based on heuristic arguments or use optimality criteria with known deficiencies, e.g. being dependent on the scaling of the data and parametrisation of the model. We develop a general theory of optimal design for data subsampling methods and derive a class of asymptotically linear optimality criteria that i) can easily be tailored to the problem at hand, ii) are invariant to the parametrisation of the model, and iii) enable fast and efficient computation for both Poisson and multinomial sampling designs.
The methodology is illustrated on binary classification problems in active machine learning, and on density estimation in computationally demanding virtual simulations for safety assessment of automated vehicles.