When using machine learning algorithms, it is often assumed that the data is complete. In real-life applications, however, this assumption is usually over-optimistic. “Missingness” can happen in many ways: some missing covariates, some missing responses, only a lower bound is given for the response (i.e., the response is right censored), observations are seen only if they crossed some level (i.e., left truncation), or a label is given only to a bag of observations. We develop machine learning tools that can handle missing data, using imputation, inverse probability weighting, and doubly-robust estimators.
Data scientists are interested in answering questions such as how confident one is in a prediction, and whether a certain feature has a significant influence on the response variable. Drawing statistical inference for machine learning algorithms is difficult. We study methods for performing statistical inference for two common machine learning techniques: kernel machines and deep learning. We utilize Bayesian methods to quantify uncertainty, select hyper-parameter values, and to bound the generalization error. We propose novel PAC-Bayes generalization bounds which can be data-dependent.
To help policymakers set policy based on scientific methods, we use mathematical modeling and advanced statistical tools to study different aspects of the COVID-19 pandemic. Our research includes learning the susceptibility and infectivity of children and adolescents; the protection of vaccination and previous SARS-CoV-2 infection in preventing subsequent SARS-CoV-2 infection and other COVID-19 outcomes; and the effect of COVID-19 on different aspects of public health, such as suicide rate and natural abortion.