Chowdam Venkata Thirumala Kumar supervised by Dr. Chiranjeevi Yarra received his Master of Science in Electronics and Communication Engineering (ECE). Here’s a summary of his research work on Towards an End-to-End Spoken Grammatical Error Detection Under Three Speaking Practice Conditions:
Language learning technologies such as Computer-assisted language learning (CALL) systems have gained popularity in recent years. This growing popularity is driven by the increasing demand for second language acquisition. Especially, the number of second language learners of English is growing significantly [1], as it is the primary medium for education and communication globally [2]. Second language acquisition involves learning different aspects of the desired language, such as (1) comprehension, which involves mastering vocabulary, grammar, etc., and (2) pronunciation, which focuses on accurately articulating words of the desired language. Proficiency in both of these areas is pivotal for effective communication as comprehension aids in constructing grammatically correct sentences, while pronunciation facilitates clear and precise articulation of words. Spoken grammar refers to the grammatical structure in the spoken language. It is essential for clear and effective oral communication. Thus, it is an important skill in language acquisition. Despite the importance of learning spoken grammar, most existing CALL systems focus primarily on text-based lessons and practice questions, as textual materials are easier to create and evaluate. Consequently, substantial progress has been made for grammatical error detection and correction in text [3], with many tools available, such as Grammarly [4] and Write & Improve, etc. In contrast, there are limited works on automatic grammar evaluation methods for speech. This is due to the need for multiple stages of processing of spoken sentences, including (1) decoding the spoken sentence to text and (2) detecting or correcting the grammatical errors in the decoded text. Generally, the errors made during the decoding process can propagate to subsequent steps, resulting in incorrect output. Furthermore, the lack of publicly available datasets makes it even more challenging. Spoken Grammar Error Detection (SGED) involves identifying grammatical errors in spoken utterances. Traditionally, SGED has been implemented through a cascaded pipeline where Automatic Speech Recognition (ASR) converts speech to text, followed by a text-based grammatical error detector (GED). However, this approach is limited by ASR error propagation, where recognition errors and language-model biases can overwrite or mask the learner’s actual grammatical mistakes, leading to suboptimal performance. In this work, we systematically study statistical and end-to-end ASR systems for SGED under three speaking practices, which are commonly employed in CALL systems: memorised, semi-spontaneous, and spontaneous speaking practices. In the literature, spoken grammatical error detection (SGED) and spoken grammatical error correction (SGEC) are often approached through a cascaded system consisting of two primary components: (1) Automatic Speech Recognition (ASR), which converts speech to text, and (2) Grammatical Error Detection/Correction (GED/C), which identifies or corrects grammatical errors in the decoded text. If the speech contains disfluencies, such as filler words, non-verbal utterances, repetitions, or self-corrections, then a Disfluency Detector (DD) is placed between ASR and GED/C to remove these disfluencies. The majority of existing work has primarily focused on enhancing text-based GED or GEC systems and/or DD by freezing the ASR component [5, 6, 7, 8, 9]. These enhancements aim to make the GED or GEC models more compatible with the output of ASR. Additionally, some approaches incorporate ASR confidence scores, either replacing or complementing ASR-decoded text [10, 11, 12]. However, despite these advancements, building end-to-end SGEC systems remains challenging due to the lack of publicly available training data. The only accessible dataset is the NICT-JLE corpus [13], which contains audio and transcripts of interviews with Japanese learners. Although the corpus includes manual annotations for disfluencies and grammatical errors along with their corrections, only the annotated transcriptions have been released publicly. This limitation forces most research efforts to adopt a cascaded system design, where ASR outputs are fed into GED/C and DD modules [14]. In this work, we systematically study statistical and end-to-end ASR systems for SGED under three speaking practices, which are commonly employed in CALL systems: memorised, semi-spontaneous, and spontaneous speaking practices. In memorised and semi-spontaneous speaking conditions, the intended grammatically correct text is known in advance, allowing us to study how different ASR-derived cues behave when learners produce incorrect forms. In the statistical ASR setting, we analyse pronunciation and alignment-based features such as Goodness of Pronunciation (GoP) scores and Force Alignment (FA) likelihoods, using the known grammatically correct text reference to examine how these cues vary with grammatical deviations. In the end-to-end ASR case, we similarly use Connectionist Temporal Classification (CTC) aligner probabilities computed against the ground truth grammatically correct text to understand how alignment behaviour changes when the spoken utterance deviates from the expected grammatical form. For spontaneous speaking condition, where no reference text is available, most of the existing works rely on the traditional cascaded system consisting of ASR and GED approach relying on the ASR decoded text; however, this design is constrained by ASR error propagation and language-model bias. To overcome the inherent limitations of cascaded SGED, we propose a method that directly transforms raw speech into a textual embedding space, termed unified embeddings, for SGED in two settings: (1) an end-to-end approach that predicts grammaticality directly from speech, and (2) an end-to-end fusionbased approach that combines unified embeddings with ASR transcript embeddings to leverage both modalities jointly. The unified embeddings are obtained by distilling knowledge from a pre-trained text encoder into a speech encoder via speech-text unification. To the best of our knowledge, this is the first work to explore both end-to-end and fusion-based approaches for the SGED task. Experiments show that the proposed unification-based fusion approach outperforms the cascaded baseline and end-to-end SGED, highlighting that unified embeddings act as an effective complementary information for mitigating ASR errors in SGED.
January 2026

