Sushmita received her doctorate in Electronics and Communication Engineering (ECE). Her research work was supervised by Dr. V Suryakanth Gangashetty and co-supervised by Dr. Nilesh Madhi, Ghent University, Belgium. Here’s a summary of her research work on Robust Estimation of Direction of Arrival and Time-Frequency Masks for Speech Enhancement:
Nowadays, we use speech-enabled smart devices to improve human-machine interaction. One of the primary tasks in these speech-enabled devices is speech enhancement. Speech enhancement is the extraction of the desired speech signal from the noisy and reverberant signals recorded by the microphones. The noise could be background noise or interfering speakers. The performance of the speech-enabled devices relies significantly on the performance of the speech enhancement methods. Speech enhancement is also a primary task in devices that improve the comfort of human-human interaction, such as hearing aids and hands-free communication devices. Among several speech enhancement methods, beamformers and Time-Frequency (TF) maskbased methods are widely used. Beamformers are linear spatial filters that aim to boost the signal coming from a specific direction by appropriate configuration of the microphone array, and in doing so attenuates interfering signals from other directions. A source TF mask identifies the time-frequency regions where the source is dominant and can be applied on the mixture TF representation to extract the desired source. Recently, enhancement with mask-aided beamformers has become popular, where TF masks are used to estimate Second order Statistics (SoS) required for computing the weights of the beamformers. Several beamformers and a few TF mask estimators need the Direction of Arrival (DoA) of the sources, which must be estimated from the microphone data, if not known a priori. DoA refers to the azimuth and elevation angles of arrival of the sound sources with respect to the microphone array axis. For example, to design a set of beamformers called the data-independent beamformers, DoA of the source to be enhanced is necessary. Also, the DoA of the desired and undesired sources are required to set the constraints while designing data-dependent beamform-ers. In literature, a few TF masks are estimated based on prior knowledge of the DoA of the sources. In all the enhancement methods that utilize the DoA information of the sources, the more closer the DoA estimates are to the ground-truth DoAs, the better is the enhancement. Similarly, for good quality of the enhanced speech in methods that use TF mask, which may be estimated with or without prior knowledge of the DoAs, it is necessary to estimate the masks as close as possible to the ideal masks. The ideal masks are estimated from the knowledge of the speech and noise signals present in the microphone recordings and are generally used to obtain an upper bound of various performance metrics. To summarize, several speech enhancement methods utilize the information of source DoAs and TF masks and it is necessary that the DoA and the masks be accurately estimated for high quality of the enhanced signals. The thesis addresses these problems: the problem of robust DoA and TF-mask estimation from the noisy and reverberant microphone signals and the problem of enhancement of a TF mask. Specifically, this thesis proposes two robust DoA estimators and a novel TF mask interpolation technique to improve the TF masks. Further, the efficacy of eigenvalue features for robust TF mask estimation in a resource constrained, neural network-based speech enhancement task is investigated. The first DoA estimator is a Non-negative Matrix Factorization weighted Steered Response Power beamformer abbreviated as the SRP-NMF. The broadband SRP beamformers cannot perform multi-speaker DoA estimation in a single time frame, a drawback which is overcome in the SRP-NMF by NMF weighting. The weights are obtained by NMF of the mixture spectrogram and correspond to the NMF atoms of the underlying sources. SRP-NMF exploits time-atom sparsity, i.e., in any one time frame only a few atoms are active and each active atom only belongs to one speaker, and weighting with different atoms in a time frame allows for multi-speaker DoA estimation. On evaluations conducted on data from public challenges and data generated from recorded room impulse responses and with various microphone array configurations, the SRP-NMF method outperforms the widely used variants of narrowband and broadband DoA estimators in terms of source detection capability and DoA estimation accuracy. The second DoA estimator, SFF-PHAT-env, estimates the directional information of the sources by, PHAse Transform (PHAT) weighted cross-correlation of the amplitude envelopes at several frequencies (obtained by passing the microphone signals through a narrowband filter called Single Frequency Filtering (SFF)) across the channels. SFF-PHAT-env is proposed to improve the performance of an existing SFF-based DoA estimator, in which the correlation is performed without PHAT weighting. The high signal-to-noise ratio regions in the envelopes, PHAT weighting, and multiple evidences at several frequencies, result in robust DoA estimates. The performance of SFF-PHAT-env is compared with the other existing SFF-based methods and the state-of-the-art Generalized Cross Correlation (GCC)-based methods. The tests are conducted on publicly available data collected in real rooms in challenging conditions such as high reverberation and multiple speakers. For rigorous evaluations, the data is further corrupted by different types of noise. From the experiments, it is observed that the best performing SFFbased methods are better or comparable to the best GCC-based estimator in detection metrics such as F-measure and accuracy metrics based on azimuth error. Irrespective of whether the TF mask is obtained as an ideal binary mask (IBM) or by any practical method, there are often regions of the target speech that are suppressed, either because the ratio of signal energy to interference energy fell below the set threshold for the IBM or due to errors in practical approaches. The suppressed target speech leads to ‘holes’ in the reconstructed signal that can produce audible artefacts. The influence of these errors could be reduced by estimating such missing data points by some form of interpolation. We focus here on using NMF as one such tool for missing data interpolation. The existing NMF-based methods of interpolation are computationally intensive and do not offer a means to control the degree of interpolation, resulting in over-estimation of the missing regions and leading to noisevocoded output. In particular, in the proposed NMF-based interpolation method, we address the drawbacks of the existing methods. This work considers the improvement achievable by applying the proposed method to ideal binary mask-based gain functions. The instrumental quality metrics (Perceptual Evaluation of Speech Quality (PESQ) and Signal-to-Noise Ratio (SNR)) indicate the added benefit of the missing data interpolation compared to the output of the ideal binary mask. In resource constraint devices, the learning model cannot be complex. Hence the demand to perform a task is on the input features. The more discriminative the features, the better is the
performance. Eigenvalues are spectral features that can discriminate coherent sound sources from the spatially uncorrelated ones, as empirically verified in an existing speech enhancement method. However, to the best of our knowledge, the eigenvalues have not been used for neuralnetwork-based enhancement. In this thesis, for extracting speech from noise, we explore the efficacy of the instantaneous generalized eigenvalue features for neural network-based TF mask estimation. These features are compared with the commonly used spectral features – magnitude spectrogram and norm of magnitude spectrograms across the microphones. Tests are conducted in both matched and unmatched noise conditions. Eigenvalue features show better improvements in objective scores that measure the quality and intelligibility of speech signals.