Nikhil Priyatam received his doctorate in Computer Science and Engineering. His research work was supervised by Prof. Vasudeva Varma . Here’s a summary of Nikhil’s thesis, Medical Information Extraction from Social Media as explained by him:
Social media has provided new opportunities to users for creating, accessing, and sharing information in a location independent and timeless fashion. Recent years have witnessed exponential rise in the number of individuals and organizations using social media for sharing healthcare information which is publicly accessible. Medical social media is a subset of social media, in which the interests of the users are specifically devoted to medicine and health related issues. Medical social media encompasses healthcare related texts in generic social media platforms such as WordPress, Twitter, Facebook, Quora, Instagram, YouTube as well as medical forums such as patientslikeme.com, patient.info/forums, doctorslounge.com and kevinmd.com/blogs. People use medical social media for a variety of reasons such as seeking answers to specific questions, giving expert advice about a particular drug or a treatment, spreading awareness, sharing experiences, reporting discoveries and findings, voicing opinions and forming communities. Apart from individual users, a lot of medical organizations such as hospitals, clinics, and insurance companies are also actively contributing to medical social media.
Medical social media data plays a crucial role in several applications such as studying the unintended effects of a drug (pharmacovigilance), hiring potential participants for a clinical trial, promoting a drug, monitoring public health and healthcare delivery. However, extracting information from medical social media is challenging due to various reasons. First, medical social media posts are highly noisy; they are plagued with incorrect spellings, incorrect grammar, non-standard abbreviations and slang words. Owing to this, the current medical information extraction tools fail to extract concept mentions such as imence pain in ma leg which are very frequent in medical social media. Secondly, several applications rely on human labeled examples for training supervised machine learning models. Manually creating such datasets is effort-intensive and expensive. Lastly, several applications require the persona associated with a medical social media post. Medical social media contributors belong to various persona such as patient, consultant, journalist, caretaker, researcher and pharmacist. Identifying the medical persona from the content of a medical social media post is a challenging task.
In this thesis, we propose solutions for three important problems related to information extraction from medical social media. In our first contribution, we address the problem of medical persona classification which refers to computationally identifying the medical persona associated with a particular medical social media post. We formulate this as a supervised multi-class text classification task and propose a neural model for it. In order to minimize the human labeling effort, we propose a distant supervision based approach to heuristically obtain labeled examples which can be used for training the model.
In our second contribution, we address the task of medical concept normalization, which aims to map concept mentions such as not able to sleep to their corresponding medical concepts such as Insomnia. We propose neural models which are capable of mapping any concept mention to its corresponding medical concept in standard medical vocabularies such as SNOMED CT1 . There are several challenges associated with existing methods for normalizing medical concept mentions. First, creating training data is effort intensive, as it requires manually mapping medical concept mentions to entries in a target lexicon such as SNOMED CT. Secondly, existing models fail to map a mention to target concepts which were not encountered during the training phase. Thirdly, current models have to be retrained from scratch whenever new concepts are added to the target lexicon, which is computationally expensive. We propose a neural model which overcomes these limitations. Our model scales to millions of target concepts and trivially accommodates a growing target lexicon size without incurring a significant computational cost. While our approach reduces the need for human-labeled examples, it does not completely eliminate it. In order to overcome this grave practical challenge we propose a distant supervision based approach to train our model. We extract informal medical phrases and medical concepts from patient discussion forums using a synthetically trained classifier and an off-the-shelf medical entity linker respectively. We use pretrained sentence encoding models to find the k-nearest phrases corresponding to each medical concept. The resultant mappings are used to train our model, which shows significant performance improvements over previous methods while avoiding manual labeling.
In our third contribution, we focus on the problem of automatic simplification of medical text. Patients and caregivers are increasingly using the web for understanding medical information, making health decisions, and validating physicians’ advice. However, most of this content is tailored to an expert audience, due to which people with inadequate health literacy often find it difficult to access, comprehend, and act upon this information. Medical text simplification aims to alleviate this problem by computationally simplifying medical text. Most text simplification methods employ neural seq-to-seq models for this task. However, training such models requires a corpus of aligned complex and simple sentences. Creating such a dataset manually is effort intensive, while creating it automatically is prone to alignment errors. To overcome these challenges, we propose a denoising autoencoder based neural model for this task which leverages the simplistic writing style of medical social media text. Experiments on four datasets show that our method significantly outperforms the best known medical text simplification models across multiple automated and human evaluation metrics.
In our first contribution, we address the problem of medical persona classification. In this work, our focus is on understanding who has authored a particular social media post. In our second contribution, we address the problem of medical concept normalization. This work is targeted at understanding what medical information is a particular social media post providing. In our third contribution, we address the problem of medical text simplification. In this work, our focus is on understanding how is information being conveyed in medical social media and how can it be used to simplify medical text. In all the three problems, we identify the need for manual labeling for training models as one of the major bottlenecks. In this thesis we propose solutions to these problems while reducing the need for human labeling.
1 SNOMED CT is an acronym for Systematized Nomenclature of Medicine – Clinical Terms. The Jan 2019 version of International SNOMED CT contains more than 450,000 unique medical concepts.