Malladi Bhaskararama Sahishna Advaith supervised by Dr. Radhika Mamidi received his Master of Science – Dual Degree in Computational Linguistics (CLD). Here’s a summary of his research work on Oversight Techniques for Trustworthy AI: Explaining Differences between Models and Detecting Unfaithful Generations :
The rapid evolution of machine learning from task-specific systems to general-purpose decisionmakers has amplified concerns around transparency, reliability, and trust. Despite strong performance, modern models remain opaque, often exhibiting unpredictable behavior and generating outputs that are fluent yet unfaithful to the input or task constraints. These challenges necessitate principled oversight mechanisms that can both explain model behavior and detect failures in a scalable and reliable manner. This thesis addresses these challenges through the lens of behavioral faithfulness, focusing on two complementary problems: (1) generating faithful explanations of model behavior, and (2) detecting unfaithful model outputs. We introduce SLED (Sample Learning to Explain Divergence), a framework for producing global natural language explanations that characterize where two machine learning models converge and diverge in their predictions. SLED synthesizes representative inputs via gradient-based optimization to identify regions of agreement and disagreement, and leverages large language models to generate interpretable explanations grounded in these behavioral patterns. Unlike prior approaches based on local feature attributions or training data access, SLED provides model-level insights with minimal data requirements. Experiments across 13 datasets and 11 model configurations show that SLED consistently outperforms baselines such as MaNtLE, LIME, and Anchors, achieving improvements of 18–24% in faithfulness and 10–22% in simulatability. Ablation studies demonstrate that the framework is sample-efficient, robust to the choice of explainer models, and benefits from regularization that ensures realistic synthetic samples. Human evaluations further validate its utility, with users achieving a simulatability score of 63.5% and reporting improved understanding of model behavior. In parallel, this thesis formalizes hallucinations as failures of faithfulness, enabling systematic analysis across tasks. We propose a computationally efficient, NLI-based approach for detecting hallucinations in natural language generation tasks such as definition modeling, machine translation, and paraphrase generation. This framework models hallucination detection as a function of both model inputs and training distributions, allowing it to generalize across domains. Overall, this work establishes faithfulness as a unifying principle for trustworthy AI. By grounding explanations in observable behavior and reframing hallucinations as measurable failures of that behavior, it bridges the gap between model decision processes and human understanding, contributing toward more transparent and reliable machine learning systems.
May 2026

