Suba S supervised by Dr. Nita Parekh received her doctorate in Computer Science and Engineering (CSE). Here’s a summary of her research work on From Pixels to Prognosis: Machine Learning Solutions for Critical Healthcare Challenges:
Application of machine learning methods to two important problems, namely, detection of COVID19 using chest radiographs (X-rays and CT scans), and molecular subtyping of breast cancer using multi-omics data is carried out. The recent pandemic made clear the need for fast and reliable techniques in distinguishing pneumonia caused by the novel virus SARS-CoV-2 from pneumonia caused by other viral/bacterial/fungal infections. In this work, a basic CNN model was built from scratch and spatial attention-based mechanism (Attn-CNN) incorporated to detect the manifestations of COVID-19 in CXR and CT scan images with improved generalizability and explainability has been developed. The proposed spatial attention-based solution overcomes the need for lung segmentation and region-based annotations for training the CNN models while keeping the model complexity minimized, thus making it deployable in clinical settings. To verify the generalizability of the models, testing has also been carried out on external datasets and explainability has been provided using Grad-CAM visualization of the pixels, selected by the model for classification. Performance evaluation of the proposed approach against five state-of-the-art deep learning models showed 95% accuracy for CXRs and 96% for CT images and outperformed all other models and comparatively generalized well on external datasets. Advancements in the high-throughput techniques have generated large volumes of data, enabling genome-wide profiling of various omics data, such as protein-coding and non-coding (e.g., miRNA, lncRNA, etc.) genes, DNA methylation, and analysis of genetic variations (SNVs, CNVs, etc.). However, identification of diagnostic and prognostic biomarkers is challenging due to heterogeneity at multiple levels and the huge number of features associated with each. This heterogeneity is seen to affect the generalizability and explainability of ML models. To address the high dimensionality and explainability issues, a knowledge-based feature selection framework along with a filtering approach using predominant correlations is proposed for multi-omics-based biomarker identification. Breast cancer being hormone-dependent cancer, we considered the molecular subtype classification based on the three hormone receptors, viz., estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2): Luminal (ER+, PR±, HER2±), HER2-enriched (ER–, PR–, HER2+), and Triple Negative (ER–, PR–, HER2–). DNA methylation data from protein-coding genes and long noncoding RNAs (lncRNAs) were integrated with gene expression data of the associated genes and copy variant genes for feature selection and classification. Using 172 features obtained from the proposed framework, stratified 5-fold cross-validation was carried out using five ML models. The best performance is obtained for Random Forest model with an accuracy value of 98.19% and AUC values ≥ 0.98 for all the three classes showing the effectiveness of the proposed approach.
January 2025