Aseem Anand received his MS Dual Degree in Computer Science and Engineering (CSE). His research work was supervised by Prof. Kamal Karlapalem. Here’s a summary of Aseem Anand’s thesis Distributed Skyline Query Algorithms.
The problem of multi-objective decision making has been an important one in the field of database management systems. Establishing dominance relationships among different data points has useful applications in industries such as tourism & sports. Given a dataset D with c points, a point pi dominates another point pj if it is better than or equal to pj in all the dimensions in space S and better than pj in at least one dimension. The set of such points which dominate all other data points in the dataset is referred to as the skyline set of points.
The skyline operator suffers from the curse of dimensionality which means as the number of dimensions increase the number of skyline points become large and difficult to compute. With data growing at exponential rate in today’s information age traditional algorithms designed for skyline query computation over limited resource standalone computers will not fare too well. Scaling up systems in terms of computing power & memory has traditionally proven to be expensive. Therefore, there is a need to leverage a cluster of cheap commodity hardware to efficiently compute skylines of large datasets. Thus the data needs to be divided into multiple smaller datasets which can be used to compute individual skylines across the smaller datasets and merged to form the overall skyline of the entire dataset by using a divide and conquer approach.
The curse of dimensionality is even more amplified on large datasets hence making the problem of computing skyline even more complex, thus motivating us to use techniques which do not require us to eliminate points already in the computed skyline by points encountered afterwards. We achieve this by approaching points along the Z-order curve which provides us a monotonic access order of data points thus identifying skyline points earlier. The Z-address of a point in multi-dimensions is simply calculated by interleaving the binary representations of its coordinate values.
In this thesis, we address the problem of efficient computation of skyline for large datasets using the Map-Reduce programming paradigm. Our approach is to map multidimensional data to one dimension while preserving the locality of the data points using Z-order curve and exploiting the sort phase of a typical Map-Reduce job to receive these points in the Reduce job automatically sorted in the requisite order.
Our results show a reduction of up to 31% in execution time for skyline computation over a Map-Reduce framework using our algorithm as compared to existing methods.