[month] [year]

Abhinav S Menon

Abhinav S Menon supervised by Dr. Manish Shrivastava received his Master of Science – Dual Degree  in Computational Linguistics (CLD). Here’s a summary of his research work on Formal Languages for Mechanistic Interpretability:

 Neural models have seen exponential changes, both in terms of scale and deployment, in the years since transformers and large language models were developed. The scale of these models and of their training data have enabled them to reach near- (and in some cases super-) human performances in several tasks. However, this raises concerns of value misalignment and potential misbehaviour of these models in high-stakes situations. This creates the need for a more fine-grained, general, and mathematical understanding of the functioning of these models, with the objective of being able to reliably and generally predict and control their behaviour. This is the central e!ort of interpretability, a field of study aiming to reduce the heavily overparameterized functions implemented by neural nets to simple, sparse, and abstract causal models. However, the relative immaturity of the discipline has meant that the rigour of paradigms, techniques, and experiments has not seen consensus. In this thesis, we present a proof of concept that analogy with the natural sciences can form a valuable foundation for achieving the long-term aims of interpretability; in particular, we leverage the reductionist approach to understanding complex systems, and apply it to the study of deep models. We restrict our scope to models that operate on natural language–or, more generally, text–rather than other modalities like images, audio, or time series. We take inspiration, therefore, from computational linguistics, which in its incipient phases relied on a remarkably expressive reduction of natural language– formal grammars. We exploit this concept to idealize the conditions under which we examine neural language models, and present a study that operationalizes this intuition. Concretely, we examine the recently popular sparse autoencoder (SAE) method for interpretability. This method centres on using two-layer MLPs with a sparse, overcomplete hidden representation, trained to encode a latent space of a large model, in the hopes that meaningful semantic decompositions of this space arises. We use language models trained on formal grammars, attempt to uncover relevant features using this approach, and try to find properties of the approach that are significant to its usability. Our findings align for the most part with existing conclusions on the properties of SAEs (although these were based mostly on experiments in the image domain) such as their sensitivity to inductive biases and lack of robustness. Most significantly, we note that the features identified by SAEs are rarely causally relevant– ablating them fails to produce the expected elects most of the time. As causality has emerged as a widely agreed upon sine qua non among interpretability researchers, this is a major deficiency of the method. We propose, accordingly, a modification of the pipeline that aims to incentivize the causality of identified features, and demonstrate its efficacy in the same setting of formal grammars. Overall, we believe that our results demonstrate the potential of importing scientific modi operandi into interpretability, and more specifically, the capacity of reductionism to provide useful insights into the functioning of deep models.

June 2025