V Dharma Teja supervised by Prof. Kishore Kothapalli received his doctorate in Computer Science and Engineering (CSE). Here’s a summary of his research work on Efficient Parallel Algorithms for Sparse Matrix Operations on a GPU with applications:
Computational approaches use computing machines to perform functional tasks like image classification, speech recognition, fluid flow simulation, etc. A computational approach is composed of many computational tasks, and the time taken to perform a functional task using a computational approach depends on the time taken to process the underlying computational tasks of that computational approach. Furthermore, the time taken to perform a computational task depends on three factors: the computing machine, algorithm and data structures, and the input to the computational task. Improved run time performance for a computational task can be obtained by co-designing algorithm and data structures while being aware of the target computing machine and the input. A computational approach, when applied to a functional task, achieves a certain level of accuracy and takes a certain amount of time to perform. While accuracy is independent of the computing machine, time is dependent on the choice of the computing machine. Hence, better computational approaches can be designed by being aware of the computing machine. The proposed research work is categorized into two themes: 1) Design efficient algorithms and data structures for a computational task on a given instance of (computing machine, input class). 2) Design efficient computational approaches for a functional task on a given computing machine. In these themes, there are choices to be made on the computing machine, computational tasks, and computational approaches. We chose GPU for the computing machine due to its high throughput, high memory bandwidth, and its applicability in accelerating a wide variety of computational tasks from different application domains. We chose sparse matrix operations for the computational tasks, as they play a crucial role in computational approaches that are used to solve problems in application domains like scientific computing, artificial intelligence(AI), and graph analytics. We chose Artificial Neural Networks(ANN) for the computational approach, as it is able to achieve human-level accuracy for many functional tasks in the AI domain. The GPU computing machine and ANN computational approach have kickstarted the AI revolution, and we made these choices because of the opportunity it offers and the potential impact that can be created using our proposed co-design approach.
In the first part (Chapters 2-3) of the thesis, we focus on the co-design approach for designing better algorithms and data structures for computational tasks on the GPU. We work on two computational tasks, namely QBMM (Matrix Multiplication computational task (C = A ⇥ B), where A and B are Quasi Band sparse matrices), and HSOLVE (Matrix solver computational task (Ax = b), where A is a Hines sparse matrix), and achieve an average speedup of 2x for QBMM and 2.5x for HSOLVE by using the co-design approach. In the second part (Chapters 4-8) of the thesis, we focus on GPU-aware computation approaches based on Artificial Neural Networks (ANN) for efficiently performing functional tasks in the area of Artificial Intelligence(AI). The efficiency of an ANN can be measured using four metrics: accuracy, memory, compute, and runtime. Sparse ANNs, a subclass of ANNs, are shown to have better memory and compute metrics with minimal/no loss in the accuracy metric compared to dense ANNs. However, the only problem is that despite having less compute, sparse ANNs take more time to process on GPU than dense ANNs. This is because sparse ANNs have irregular compute and memory access, which is unsuitable for highly regular hardware like GPU. This problem can be alleviated by making the sparsity structured. However, introducing structure negatively affects the accuracy metric. The challenge, then, is to design structured sparsity patterns that lead to better runtime metrics on the GPU while maintaining the accuracy metric. Block sparsity pattern leads to better runtime performance on the GPU when compared to the unstructured sparsity pattern, due to its regularity in compute and memory access patterns. So in Chapter 5, we propose a novel method to train block sparse ANNs from scratch. Sparse ANNs can be generated either by training from scratch or by fine tuning. While training from scratch provides maximum flexibility in generating sparse ANNs, finetuning is widely used to generate sparse ANNs because it is faster and uses an already available pre-trained dense ANN as the starting point. However, the main disadvantage of the finetuning method from a runtime perspective on GPU is that finetuning is very sensitive to the imposed sparsity pattern, with accuracy inversely proportional to the rigidity in the sparsity pattern. While block sparsity pattern leads to better runtime performance on the GPU, its rigidity (a block must be either completely zero or not) leads to less accurate sparse ANNs with fine tuning methods. So, to fix the rigidity in block sparsity and make it more flexible for maintaining accuracy during finetuning, we introduce the idea of having multiple blocking levels in the sparsity pattern. The main advantage of a sparsity pattern with multiple blocking levels is its ability to retain essential parameters that play a crucial role in achieving good accuracy for the fine tuned sparse model while maintaining the structure required for improving runtime performance.
We propose two sparsity patterns with multiple blocking levels: HB (Hierarchical Block) in Chapter 6 and RMB (Regularized Multi Block) in Chapter 7. With HB, we showed that the accuracy gap present with the plain block sparsity pattern can be significantly bridged while retaining the potential to accelerate on the GPU. With a deep understanding of the GPU architecture and fine tuning dynamics, we build up on the HB sparsity pattern and design a more generic, uniform, and accuracy-friendly RMB sparsity pattern by introducing regularizing constraints on multiple blocking levels while retaining the flexibility required for maintaining accuracy. With the RMB sparsity pattern, we reach an important milestone of achieving the same accuracy level as that of the unstructured sparsity pattern (most suitable for accuracy and least suited for runtime performance on GPU) while being 4-5x faster on the GPU for the industry standard image classification task on Imagenet dataset using the Resnet50 model. We further apply our RMB sparsity pattern on an image segmentation task and observe the same accuracy levels as the unstructured sparsity pattern while being 4-5x faster on the GPU. It is also important to note that on the latest A100 and H100 NVIDIA GPUs, hardware acceleration is provided for the 2:4 sparsity pattern, which is a simple configuration of our generic RMB sparsity pattern. Edge devices are widely deployed to perform AI tasks for latency, reliability, and privacy reasons. These devices operate on a low power budget and thus have limited memory capacity and compute capabilities. Because of the importance of edge devices and their unique characteristics, there is a need to develop ANNs that are efficient on edge devices. In Chapter 8, we address this need using our proposed RBGP (Ramanujan Bipartite Graph Product) framework. The RBGP framework uses Ramanujan graphs to improve the accuracy metric by ensuring good connectivity between neurons in the ANN. The RBGP framework uses the idea of graph products for improving memory and runtime metrics by generating structured sparse ANNs with cloning and uniformity properties. On the NVIDIA Jeston Nano GPU, we are able to achieve the same accuracy levels as that of unstructured sparsity while reducing the memory requirement by half and being 7-10x faster on the GPU. Even though the RBGP sparse ANNs are designed for edge devices, they achieve similar gains on the NVIDIA V100 server GPU. In summary, our work demonstrates that significant runtime benefit can be obtained for computational tasks and functional tasks by using the co-design approach and hardware aware computational approaches respectively. Our choices of GPU for the computing machine, sparse matrix operations for the computational tasks, and ANNs for the computational approach are made only to demonstrate the potential benefit. Nevertheless, the same ideas can be applied to different choices of computing machines, computational tasks that arise, and computational approaches that are used to solve a wide variety of functional tasks from different application domains.
June 2024