Lalit Mohan S received his doctorate in Computer Science and Engineering (CSE). His research work was supervised by Dr. Raghu Reddy. Here’s a summary of Lalit Mohan’s thesis, A Fine Grained Approach to Develop Domain Specific Search Engine:
Growth of content and users on internet provides research opportunities in crowdsourcing, search engines and other areas that consume and process web. Lack of clarity in task definition, uncertainty in completion time, concern on data confidentiality, unavailability of workers, cost considerations and quality of output are some of the research opportunities in crowdsourcing. Quality of output is one of the primary concerns in crowdsourcing tasks, as lack of it defeats the objective of using crowd. Researchers are exploring various mechanisms that are based on statistics, machine learning and game theory for acceptable output quality.
As part of this work, an online survey conducted on crowdsourcing reiterated importance of quality assurance and quality control of tasks. The online survey along with review of research literature confirmed that quality mechanisms on crowdsourcing tasks benefit from availability of credible knowledge base in a domain. To build domain specific knowledge base, this work proposes a fine grained approach to identify sub-domains in domain and extract related content, an enrichment of a knowledge base in the form of an ontology and credibility assessment based on web genres. The research outcomes on creation and enrichment of knowledge base are used to develop a domain specific search engine.
Web pages on internet contain content across domains. Crawler efficiency, sub-domains representation and noise reduction are required to extract domain specific content. A systematic approach to identify sub-domains in a domain is proposed. Further, the work extends metaheuristics based Artificial Bee Colony (ABC) algorithm to extract sub-domains’ URLs. The extended ABC algorithm for crawling performed better than existing industry scale open source crawlers in terms of volume of extracted URLs and usage of compute resources. A metric SeedRel to measure precision of seed URLs based on child URLs presence and content relevance is proposed. The work measured sub-domains coverage with a baseline value of Shannon Diversity Index. In the experiment of this work, about 34,007 seed URLs and 400,726 child URLs of information security sub-domains are extracted. The measured diversity index of 2.10 conforms representation of sub-domains in a domain. The results on URL extraction, seed URL relevance and sub-domain diversity performed better than existing approaches.
The extracted domain specific content is an input to build knowledge base. Ontology is a representation of knowledge and provides extensibility, interoperability and reasoning capabilities. To build knowledge base, the work proposes usage of an existing and well accepted seed ontology for enrichment so that a baseline is available, changes are incremental, auto and manual validation can be exercised. It is necessary to validate if an ontology needs to be enriched for any input to avoid unnecessary computation. Similarly, it is necessary to validate if the given ontology represents available domain content. To validate a need for enrichment, a lightweight approach to evaluate ontology sufficiency based on software requirements engineering quality principles is proposed. It is observed from research literature that existing ontology enrichment algorithms that are based on natural language processing and machine learning techniques perform poorly in contextualized extraction of concepts. This necessitated implementation of sequential deep learning architecture that traverse through dependency paths in text and extract concepts embedded in phrases and sentences based on learned path representations. The proposed fine grained ontology enrichment approach (OntoEnricher) exploits both syntactical sentence structures and distributional semantics to identify, extract concepts and instances in unstructured text leveraging Bidirectional LSTMs and pre-trained Universal Sentence Encoder transformer model. The ontology enrichment is experimented with 97,425 keyword phrases and 2.8 GB of Information Security corpus with an accuracy of 80%.
The credibility of extracted content enhances usage of knowledge bases. Also, search engine results suffer from non-dominant credibility factors. After a study of existing research articles on ranking and credibility, the work proposes genre based credibility assessment of web page content. An online survey (includes crowdsourcing) is conducted to validate existence and need of web genre for credibility assessment. An exhaustive list of web page features based on surface, content and off-page are identified to validate applicability of genre for credibility assessment. Web page features are used to prepare a dataset to automate genre classification using machine learning based Gradient boosted Decision Tree algorithm with an accuracy 88.75%. An open source framework for credibility assessment of URLs is developed to automate genre classification and credibility assessment. The framework assessed 10,429 URLs of information security and correlated 69% with Web of Trust score.
The work also proposes a FACT score to calculate credibility of web page.
To demonstrate usefulness of knowledge base with sub-domain URL extraction, ontology enrichment and credibility assessment, an Information Security Search Engine (SIREN) is developed as a proof-ofconcept. The SIREN is deployed on Openstack distributed architecture for ease of maintenance and scalability. The source code of the application is available on GitHub for further enhancements and develop other domain specific search engines. The contributions demonstrated at Annual Information Security Summit, National Centre of Excellence for Cyber Security event and Banks’ CISO Forum are well received. The work on SIREN (Security Information Retrieval and Extraction eNgine) will be integrated into IB-CART (Indian Banks’ – Centre for Analysis of Risks and Threats) platform so that organizations can use it for crowdsourcing threat intelligence.