Shashwat Khandelwal received his MS Dual Degree in Electronics and Communication Engineering (ECE). His research work was supervised by Dr. Suresh Purini . Here’s a summary of Shashwat Khandelwal’s thesis Accelerating Local Laplacian Filters on FPGAs:
Image processing has various real-life applications in the world, ranging from agriculture, multimedia security, remote sensing, computer vision, medical applications, biometric verification, and many more. In today’s world, applications like biometric verification or object detection are processed on hardware accelerators to facilitate their real-time needs. While some applications are processed in data centers, others are required to be processed on edge due to security, overall latency, and other issues associated with data centers. With their high speeds, very low power consumption compared to other technologies, and reprogrammability, FPGAs provide a sustainable high-speed solution that can be deployed both in data centers and on edge. Therefore, FPGAs are considered very favorable for accelerating image processing algorithms as they are processed both on edge and in data centers. Not only are they supported by the above claims they also are very capable of exploiting the spatial parallelism (data level) and temporal parallelism (task level) [9] [2] which are inherent in image processing tasks. This thesis will focus on the hardware implementation of Local Laplacian filters used for tone and detail manipulation in images on FPGAs. The algorithm uses simple Gaussian and Laplacian pyramids that can be processed in parallel to speed up it’s processing, making it a strong candidate for acceleration on an FPGA. We propose using the LUTs present on the board to reduce complex computations involved in the algorithm, helping save hardware resources and improve latency. We propose a novel 3-stage pipelined convolution engine to exploit the massive parallelism that the algorithm has to offer. The above proposals help us achieve a good speed up as compared to the original baseline multicore-CPU implementation. We also compare our implementation’s accuracy with the original implementation and find the SNR above 40dB for all cases. To analyze our implementation’s resource usage efficiency, we compute the percentage of active and inactive cycles. We find that because of the design’s low resource utilization, it can be run on lower-end boards like ZedBoard.