Extraction And Analytics From Financial Documents

Prof. Kamal Karlapalem explores use cases of AI in the fintech world, particularly those applications that are being researched upon at the International Institute of Information Technology, Hyderabad. This article is a brief summary based on the talk he delivered at the TechForward Research Seminar Series on ‘AI Transforming Financial Services’.

In the financial world, AI leverages machine learning, natural language processing, and predictive analytics to enhance the quality of services in a myriad of ways – from automating everyday tasks adhering to regulations, to identifying threats and preventing fraud.

Interplay Between Regulators and Cos
At IIITH, active research is underway to determine the ways in which AI can ease the operations of financial institutions. For instance, Indian companies are required by law to adhere to the regulations drafted by the Securities and Exchange Board of India (SEBI), which is the regulatory authority or the ‘watch dog’ of the securities and commodities market. It essentially protects investors by laying down a code of conduct that prohibits unfair and fraudulent trade practices. These regulations which are documented need to be interpreted by the information technology departments and/or finance and legal departments of the companies. The companies’ lawyers interact with their SEBI counterparts about case arguments and outcomes. The documents themselves are interrelated and extracting the relevant interpretation is a complex task. It is here that a well-formulated AI system can aid and improve these interpretations and interactions by processing and deriving insights from SEBI regulations, associated case files and other pertinent documents.

IIITH’s AI Framework
One of the ways the institute has enabled analytics on SEBI documents is by developing a multi-layer Applied Semantics Extraction and Analytics (ASEA) framework for document processing. Work in this area has been supported by the JP Morgan AI Faculty Research Award. The lowest layer of the framework deals with document pre-processing that includes entity extraction and entity linking. The middle layer deals with semantic analytics where classification, language modelling and so on is performed. The lowest and middle layer can be supported by GenAI solutions. The top layer requires user domain inputs and understanding to deploy the solutions. This layer is the applied semantics layer wherein the extracted semantics and analytics that is performed is used for various user relevant tasks. Thus the entire pipeline of the framework supports a range of use cases such as extractive question answering, provenance of the documents, regulation violation prediction, legal case file segmentation (through sentence classification), regulation simplification, and case file summarization.

Semantic Segmentation of Case Files
In this case, first a unique dataset was created with annotated adjudication orders pertaining to regulations. Adjudication rules are essentially rulings passed by SEBI against alleged violators of the SEBI Act, rules or regulations. So here, the IIITH team experimented with a number of machine learning models to train a sentence classifier that can help in semantic segmentation of case files which in turn helps in document retrieval.

Regulation Violation Detection
Typically a single case file has a multitude of sections. In order to zero in on the section which deals with violation of regulations, the team has developed a semantic segmentation engine that separates out the different sections of the case file. Transformer-based machine learning models for NLP tasks have become extremely popular and hence the IIITH researchers too built a transformer-based multi-label classifier. Traditionally such models perform poorly on domain-specific tasks as those found in the legal, medical or scientific domain, but fine-tuning them greatly improves their performance. Similarly, in this case too, the ML model was fine tuned specifically for the SEBI domain on a dataset comprising SEBI regulations, case files, as well as SEBI-related news articles.

Regulation Biography
SEBI’s regulatory documents are a mine of information which include their amended versions and additional supporting documents related to the domain of banking. With the help of AI, the IIITH team has demonstrated how this information can not only be analysed, extracted and tagged with the help of NLP methods but also how you can visually identify the changes made from one version of a document to another. Besides, these methods can also provide additional information extracted from annual reports and concept papers which will help understand the rationale behind the amendments itself. It also provides tags to categorise the type of amendments and identifies references to the regulations in news articles.

Challenges and Future Direction
The Applied Semantic Extraction and Analytics framework can be applied to many other domains like health, scientific literature, manuals, etc. The framework can judiciously use GenAI for some of the lower level ML-NLP tasks. The challenge is to deploy GenAI at the right places to come up with a production quality solution. Extracting and managing high level semantics for documents is a challenge IIITH is keen to take up. ASEA on documents like SEBI regulations and SEBI case files requires tailormade solutions and the hope is to see further research in this direction in the future.

This article was initially published in the September edition of TechForward Dispatch 

Prof. Kamal Karlapalem is a Professor and Applied Computer Scientist at the Data Science and Analytics Centre and Agents and Applied Robotics Group, IIITH. His areas of research interest include Multi-Agent Systems Simulations, Multi-Robotic Systems, Workflow Management Systems, Visual Data Analytics, Data Mining, Text Analytics, and Database Systems. He also guides institutions for academics and curriculum development.

  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

Leave a Reply

Your email address will not be published. Required fields are marked *

Next post