Manav Chaudhary supervised by Prof. Vasudeva Varma Kalidindi received his Master of Science – Dual Degree in Lateral Computer Science and Engineering (LCD). Here’s a summary of his research work on You Can (Not) Trust: Reliability and Robustness of LLMs as Human-Like Annotators and Judges:
The rapid adoption of Large Language Models (LLMs) as tools for annotation and evaluation tasks in Natural Language Processing (NLP) has led to important questions about their reliability, robustness, and alignment with human expectations. This thesis investigates these concerns across three connected lines of inquiry: (1) the indistinguishability of LLM-generated annotations from human annotations, (2) human alignment and reliability of LLMs as subjective judges of language quality, and (3) the susceptibility of LLM-judges to misinformation attacks in evaluation settings.
In the first part of this thesis, we study whether annotations made by LLMs can be reliably distinguished from those created by humans. Previous research has claimed that LLMs behave as ‘human-like annotators’, motivating our work to rigorously test this hypothesis. We frame this as a classification task: Given a dataset of annotations, can a model detect their origin? Surprisingly, our findings indicate that even state-of-the-art classifiers achieve near-chance performance, with accuracy not exceeding 51\%. These results offer strong empirical evidence that LLMs produce annotations that are indistinguishable from human annotators in most settings, validating the ‘human-like’ claim from a discriminative modeling point of view.
Building on this, we explore whether LLMs can function as reliable evaluators: a more nuanced role that goes beyond labeling and involves subjective scoring of text quality (e.g., summaries, open-ended responses). In this phase, we test how well LLM-generated judgments align with human evaluations and how robust they are to subtle changes in prompt phrasing. We introduce controlled prompt perturbations and evaluate the consistency of LLM scores and textual justifications. The results show a significant lack of robustness: Small, often misleading perturbations lead to large changes in judgment outputs. These inconsistencies expose limitations in using LLMs as trustworthy evaluators, especially in high-stakes or subjective settings.
The final part of this thesis presents a preliminary exploration into the robustness of LLM-based evaluators when exposed to misinformation, marking a conceptual shift from controlled perturbations to real-world-inspired adversarial scenarios. We introduce a novel framework that systematically injects misinformation into prompts, grounded in a taxonomy of ten misinformation types. This study probes how these manipulations affect LLM judgments, analyzing both score alignment and justification consistency.
Contrary to our initial expectations, the results suggest that LLM-judges demonstrate a surprising degree of resilience. Scores and justifications remain largely consistent across many misinformation types, even when factual content is altered. However, this robustness appears to be superficial, with the LLM-Judge often failing to explicitly recognize or reflect awareness of being misled in its justifications. Rather than detecting and correcting misinformation, they tend to preserve their prior judgments, raising concerns about unconscious robustness without epistemic awareness.
These early findings open new avenues for future research. Comparative assessment, misinformation detection, and deeper semantic analysis may reveal subtler vulnerabilities. More broadly, this work lays the groundwork for building misinformation-aware evaluation systems and motivates the development of LLMs that not only evaluate well but do so with fact-sensitive reasoning.
Collectively, this thesis explores the evolving role of LLMs as annotation and evaluation agents. While they show promise as ‘human-like’ annotators, their behavior as evaluators under adversarial and misinformation-rich contexts highlights both strengths and blind spots. By offering early insights into their robustness and limitations, we provide a foundation for future work aimed at building trustworthy, interpretable, and context-aware LLM judges.
July 2025

