This project seeks to elucidate the mechanisms of information storage and processing in machine learning systems of human language, by (a) measuring localization and distributivity of information in complex models; (b) discovering causal relationships between model components and automatic (potentially biased) decisions; and (c) making language processing systems more interpretable and controllable. The research is expected to promote responsible and accountable adoption of language technology.
Despite the empirical success of deep learning models in natural language processing (NLP), these models face two challenges: they are opaque and difficult to interpret; and they are fragile and not robust to shifts in the data distribution. This project studies the relationship between interpretability and robustness in NLP: are more robust models also more interpretable, and vice versa? This research is expected to facilitate the development of models that more trustworthy, fair, and reliable.