[month] [year]

Swarnim S – Malaria Parasite Life Cycle

Swarnim Shukla received his Master of Science – Dual Degree in Computational Natural Sciences (CNS). His research work was supervised by Dr. Bhaswar Ghosh. Here’s a summary of his research work on  Multi-class Classification of Malaria Parasite Life cycle using Single-cell Transcriptomes:

Malaria, which is spread by the female anopheles mosquito, is a highly fatal disease that affects many parts of the world, with up to 0.4 million deaths reported worldwide. The detection of malaria infection levels is based on vital gene expressions. Experts quantify malaria parasite-infected RBCs and classify their life cycle stages at the macroscopic level in order to make informed decisions. Several computational approaches have recently been proposed to avoid the dimensionality problem and produce accurately predicted results. Our study presents a theoretical framework to select diagnosis markers and drug targets by implementing ML techniques on sc-RNA-seq data. The main objective is to select the top-ranked genes from the scRNA-seq profiles at different stages of the Plasmodium falciparum (Pf) life cycle inside infected RBC. We employ a supervised learning algorithm coupled with feature selection algorithms to extract the most relevant genes to predict the life cycle stages of Pf inside RBC. The first stage of modelling is to optimise the quality of data from the dataset (5066 features) by removing the irrelevant features. Genetic Algorithm (GA) based search technique is popularly used for feature selection and dealing with high dimensionality datasets. This reduced subset (378) is further utilised in the second stage of high accuracy multi-class classification. In this work, a GA-based dimensionality reduction technique is used on single-cell transcriptomics to obtain an optimised subset of features from a larger data set. To separately transform the selected elements into a lower dimension, features are chosen based on their class variants, taking into account increased efficiency and accuracy. We constructed the protein-protein interaction network (PPIN) of these genes and performed topological analysis using the Search Tool for the Retrieval of Interacting Genes/ Proteins database (STRING 11.0 b) and Gephi software to provide hierarchies according to the importance of the genes in the network. Various topological measures are estimated to evaluate the node characteristics in the PPINs, including degree, between centrality, eccentricity, closeness centrality, eigenvector centrality, and clustering coefficient. Proteins having a high degree and betweenness centrality tend to assert more control over the network function. We also performed gene ontology analysis to determine the role of proteins in the parasite’s life cycle progression. For the multi-class classification of the life cycle of malaria parasite based on oriented gradients and local binary pattern features, a three-pronged approach employing the multi-class Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF) techniques are used. On using these 378 features, RF performed best with a classification accuracy of 92\% while SVM had a 91\% accuracy and LR gave 88\% accuracy. By merely using the 378 features, we achieved similar or better performance scores for all four classes, across all three models. Further, randomly chosen features from our dataset of 378 were also evaluated using the SVM, LR, and RF models. We achieved an accuracy of 81\%, 79\%, and 80\% for the three respective models. This proves the robustness of the features selected using the GA-based approach. The proposed research methodology can be likely used for improved malaria diagnosis and drug targets.

June 2023