Prof. Anoop Namboodiri provides perspectives of general research trends in Computer Vision and those taking place at IIITH in particular to explain how they are propelling advancements on the Edge.
Computer Vision on the edge may sound like a new development but the truth is that it has been around for decades. Some applications that we take for granted include defect detection in factories, mobile phone unlocking using either fingerprint or face recognition, x-ray machines in airports, QR codes, and licence plate recognition at toll gates. Thanks to AI on edge, we now have far more powerful applications on the edge such as autonomous navigation and home automation devices like Amazon Echo having a built-in camera. With improved AI algorithms, newer and more powerful use cases are being enabled.
Computer Vision itself has played a pivotal role in pushing the cutting edge of AI. Examples of this include CNNs, Gen AI and multimodal foundational models. In future, when AGI systems become viable, we can be sure that computer vision and 3D understanding of the world will play an important role in it. Here, we focus on the recent trends that are pushing AI to edge devices.
Computer Vision, AI and Edge
Computer Vision involves processing of large amounts of visual data that could be in the form of images and/or videos. In order to make sense of all this data, it is imperative that we are able to do a significant amount of processing on the edge. Hence computer vision is an area that can significantly benefit from processing on the edge. There are 3 ways in which we can look at the interplay between Vision, AI and the edge:
- What innovations in Vision can allow us to put AI on the edge?
- What advancements in AI can help us push Computer Vision to the edge?
- What improvements in edge devices will allow us to utilise AI and Vision?
Innovations in Vision
Computer vision enables a machine to look at the world the way we look at it, i.e., perceive the world. But to understand where we are today in the field, it is useful to see what we started off with. The very first ancestor of the photographic camera was the pinhole camera, which was accidentally discovered a long time ago and can be found in many ancient monuments. It is essentially a lensless ‘camera’ or a dark room with a small hole in one wall. It was Leonardo Da Vinci who first compared the human eye to this ‘camera obscura’ and it was used as a model to explain human vision for centuries. From this stage, cameras developed by incorporating lenses and mirrors, recording the images on a silver plate, development of color and chrome films and finally computer vision became possible with digitization of photographs.
With digital cameras becoming popular at the turn of the century, they allowed us to process the data from the sensors before storing as an image. This also opened up a significant change in the imaging pipeline as the data from the sensors need not be close to the final image as long as it can be processed to obtain one. This field is referred to as Computational Photography. Simple examples of this include the projection of structured light patterns on an object to capture its 3D shape, which can then be used in fine grained recognition (say Hand Geometry-based person authentication).
Stereo Vision
A more complex problem is that of capturing a panoramic stereo imaging. i.e., to capture the left and right eye views when we turn our head around from a point. Rotating a stereo pair will work only on static scenes and using a set of cameras in a circle will cause significant occlusions. One of the solutions proposed by Google and FB was to use a set of cameras in a larger circle facing radially outwards and use parts of these images to stitch together the left and right eye views. However, the processing required was of the order of minutes on a compute cluster for generating a single stereo frame.
Innovations at IIITH
At IIITH, we worked on a solution that uses catadioptrics (mixture of lenses and mirrors) to capture the relevant light rays from a location as close as possible to the rotating stereo camera pair. This innovation in computer vision allows us to reduce the processing significantly and we can now move the stereo image composition to the edge and do so at 30FPS. The optics-based solution also allows us to avoid any blind spots that are inherent to multi-camera solutions.
3D Reconstruction
With advances in the capabilities of edge processors, we can now run compact deep-learning models on the edge. This would allow us to add additional functionalities like depth estimation, semantic segmentation, and navigation on the edge. One can also integrate edge processing, which is highly data intensive, along with server-based processing of tasks that are compute intensive. An example of such processing in this case would be 3D reconstruction of the world from multiple images.
Improving Deep Learning Models
The second factor that has contributed to the migration of vision to the edge is the improvement in efficient deep-learning models. One possible way to improve the efficiency is to represent all (or most) parameters of the network with fewer bits (quantization). Some of the work that was done in our lab in this direction include improvements in binary quantization that takes into account the distribution of parameter values, and ternary quantization, which integrates binarization and pruning into a single framework and that optimises the whole network.
Using Expander Graphs
Most strategies for pruning involve training the full-sized network to identify the weights that can be pruned. After removing smaller weights, the pruned network is trained further. This process is repeated until we reach the desired pruning level or until the error rate reaches its maximum allowed limit. A drawback of such training is that it is very compute-intensive and can be done only on dedicated high-end servers. We tried to pre-prune the network and then train it so that the training process can become more efficient. For this we had to bring in the concept of expander graphs from graph theory. Essentially combining knowledge from theoretical computer science with AI, which in turn is improving computer vision that can be deployed on the edge.
We have also done some work on modelling the performance of deep-learning models on the edge. This allows us to predict the maximum number of parameters of a particular model, given a performance target. This allows one to determine the best models that can be run on a given edge hardware, train them, and compare the resulting accuracies.
Securing Biometrics
Another interesting aspect of computer vision on edge-computing is the fact that these devices can now detect and recognize humans in their vicinity. This poses both privacy and security challenges. However, the availability of computational capability on the edge also allows us to deploy secure multiparty computational techniques to improve the security and privacy of biometric algorithms. Other capabilities that can be deployed on the edge in this context include biometric spoof detection.
Analysis of Edge Device Liability
We also created an autoencoder network that models the physical image transformations at the edge. We then use this model to analyse the vulnerabilities of the edge devices, specifically for the purpose of presentation attacks (PA). Essentially we showed that it is possible to bypass the PA detection. With the help of such a model, one can do hill climbing or gradient attack and fool most of the presentation attack detection systems with over 80% success rate.
In summary
We can see that the fields of computer vision, AI and edge computing are highly interlinked and improvements in one field can affect the others in a mutually beneficial manner. This symbiotic growth is also enabling a variety of applications on the edge. The resulting model of distributed computing and learning will become extremely important in the near future.
This article was initially published in the June edition of TechForward Dispatch