Saideep C - Spatio-temporal Databases -

Saideep Chennupati received his MS in Computer Science and Engineering (CSE). His research work was supervised by Prof. Krishna Reddy. Here’s a summary of his research work on Improved Approaches to Mine Partial Periodic Patterns in Spatio-temporal Databases:

Data mining is a process of extracting interesting information from the large volumes of data. Data mining algorithms are being widely used in several applications like customer relationship management, inventory management, recommendation systems, fraud detection, and surveillance. Pattern mining is an important task of data mining, which involves discovering interesting associations from large transactional databases such as market basket data, e-commerce transactions, social network transactions, etc. In the literature, efforts are being made to investigate different kinds of pattern mining models such as frequent patterns, periodic patterns, utility patterns, coverage patterns, and correlated patterns. Also, research efforts are being made to propose MapReduce based approaches to improve the scalability of pattern mining approaches.

In this thesis, we investigate improved approaches to extract partial periodic patterns (3Ps) from a given spatio-temporal database. In the literature, the model of periodic patterns has been proposed to extract the interesting associations, which appear periodically in a large temporal database. Moreover, a model of 3Ps has also been proposed to extract the knowledge of interesting associations, which appear periodically in a partial manner in a large temporal database. However, the issue of extracting 3Ps from spatio-temporal databases has not addressed. In this thesis, we have proposed an improved model and algorithms to extract 3Ps from spatio-temporal databases. Moreover, we have also proposed MapReduce framework to extract 3Ps from large spatio-temporal database.

A 3P in the given temporal database is defined based on the user-given minimum periodic-support and maximum inter-arrival time constraints. For extracting 3Ps from a given temporal database, Pattern growth and ECLAT (Equivalence Class Clustering and bottom-up Lattice Traversal) based approaches have been proposed in the literature. It has been observed that the existing model of extracting 3Ps (and the algorithms) face performance issues, if we employ it to extract 3Ps in a spatio-temporal database scenario. Based on the observation that the itemsets or patterns having large spatial distance between the items may not be interesting, we have identified the notion of maximum distance as the additional pruning criteria and proposed the improved model of 3Ps. Overall, the proposed model employs minimum periodic-support, maximum inter-arrival time, and maximum distance as three constraints to determine the interestingness of a pattern in a spatio-temporal database. The minimum periodic-support is equal to the minimum number of periodic occurrences of a pattern within the data. The maximum inter-arrival time is equal to the maximum duration in which a pattern must reappear to consider its occurrence as periodic within the data. The maximum distance is equal to the maximum distance between the items in a pattern. All patterns satisfying these three constraints are returned as interesting patterns. Based on the new model, we have proposed both ECLAT and pattern growth based algorithms to extract 3Ps from the given spatio-temporal databases. The efficiency of the proposed approaches is shown by conducting experiments on large synthetic and real-world spatio-temporal databases. We also demonstrate the usefulness of the extracted patterns through a case study by applying on air pollution and congestion datasets. To improve scalability, we have also proposed a parallel MapReduce based pattern growth approach to discover 3Ps in a spatio-temporal database. The issue is to maintain the load balancing among the machines and mine 3Ps in parallel by employing a large number of machines. To address the issue, we have proposed a parallel algorithm by incorporating the step of distributing transactional identifiers among the machines and mining the identical itemsets independently over the different machines. We have proposed an improved load allocation algorithm such that the machines receive the balanced load, approximately. Experimental results conducted on Apache Spark’s distributed environment show that the proposed parallel approach improves performance significantly with increase in number of machines.

Saideep C – Spatio-temporal Databases