N V Ravindra received his MS Dual Degree in Computational Linguistics (CL). His research work was supervised by Dr. Manish Srivastava. Here’s a summary of Ravindra’s thesis Annotation of Chemical Entities and their Roles in Patents.
This thesis adds to the knowledge of a chemical named entity detection and chemical role labeling. We present guidelines for chemical named entity detection in the Examples section of patents. In the process, we created the second-largest chemical NER dataset, the WEAVE corpus. We also developed methods for computational named entity recognition. We describe the models along with their features and gaps.
We simultaneously extend the WEAVE corpus with chemical role labels in the Examples section of patents. Chemical role labels tend to change temporally as the reaction proceeds step by step. A chemical that has the product’s role label in the first step would have a reactant role label in the next step. This fluid nature of labeling the same chemical in a coherent reaction discourse multiple times is yet to be computationally modeled. The dataset developed in this thesis would be the first step in that direction.
We evaluated the baseline performance of the WEAVE corpus using Deep Affix-base BiLSTM-CRF based neural architecture. The model achieves an F-score of 91.37%, near state-of-the-art result. We also compare the baseline performance with BioCreative-V CHEMDNER-patents corpus. It was found to be complementary in nature, hence could be used for transfer learning.
In future work, we suggest how this corpus could build systems that could automatically select the chemical reaction from literature and infer the reaction rule for a noninteractive organic synthesis design system.