Utkarsh Mehrotra received his MS Dual Degree in Electronics and Communication Engineering (ECE). His research work was supervised by Dr. Anil Kumar V. Here’s a summary of his research work on Feature-level Improvements for Detection of Multiple Speech Disfluencies in Indian English:
Disfluencies in speech are the abrupt breaks or hesitations that disrupt the forward flow of speech. The presence of disfluencies can adversely affect the performance of many speech based applications like Automatic Speech Recognition (ASR) systems. So, in order to make the systems perform effectively, disfluencies need to be identified in the speech signal. This task is referred to as disfluency detection. The works in this thesis focus on detecting multiple disfluencies in spontaneous lecture-mode speech. The IIITH-Indian English Disfluency (IIITH-IED) dataset is used for this purpose. The main focus of this thesis is to incorporate feature-level improvements in the disfluency detection pipeline to effectively model and capture the characteristics of disfluencies directly from speech.\par
In order to improve frame-level disfluency detection, Shifted Delta Cepstral (SDC) coefficients, which have the ability to capture temporal variations in the speech signal effectively, are explored. Further different configurations of SDC features are experimented with for each type of disfluency to check the effect of varying the SDC parameters. Since, disfluencies also depend on the speaking style and can vary from speaker to speaker, speaker characteristics are also included in the disfluency detection systems by giving the d-vectors (speaker embeddings) as input along with cepstral and SDC features. An improvement in the detection accuracy is obtained for all the disfluencies, showing the importance of considering temporal variations and speaker characteristics in detecting disfluencies.
In the literature, most works use frame-level or utterance-level analysis for detecting disfluencies. To analyse speech disfluencies at a supra-segmental level, syllable-level disfluency is performed here. Pre-linguistic Automatic syllabification is used to segment the input speech into perceptually distinct syllable-like units. Then, statistical prosody features are used to detect the presence of speech disfluencies in each syllable-like region. In order to add complementary acoustic information into the disfluency detection systems developed, a BiLSTM based feature extractor is used to get an acoustic representation of the syllable-like units. This acoustic representation is concatenated with the prosody features and it is found that the combined features provide an accuracy of 91.24\% while distinguishing disfluent and non-disfluent speech syllables.