Humans are blessed with the natural ability of interacting and seeing the real world. It takes our visual cortex(part of the brain responsible for processing visual information) a fraction of milliseconds to recognize objects around us.
Computers are not that blessed. It is extremely hard for a computer to interact with the external environment. What is computer vision and why is it important and a hard problem at the same time?
Computer vision is an interdisciplinary field that deals with giving the computer the ability to process, interpret, and reason about visual data such as images and videos.
In order for a computer to understand the visual world, a number of techniques and tools must be involved. A quick example for a typical face recognizer system: There has to be a camera for taking the face, a vision algorithm trained to recognize that particular face, and automated action to follow (say open the Sir office door when the face match with the real Sir face).
The above scenario sounds simple, but that's very hard in practice. For humans, recognizing people and other objects is an effortless activity. In fact, we don't even think about it because our visual system is very intrinsic. Computer vision seeks to mimic human visual system but scientists and researchers don't understand yet the real mechanism of human visual system.
Even if computer vision is hard, it is an exciting field. Nowdays, computer vision is everywhere. There are hundreds of visual sensors, billions of images/videos uploaded on social media platforms like Instagram and YouTube everyday, and so forth. All of those massive data needs to be processed so that they can be understood by computers and there is a need of building systems that can automate these processes. The daily increase of visual data on the internet and the exponential grouth of vision industries is by far two of the reasons why computer vision is an indemand field to day.
Applications wise, computer vision is powering many things to day, from cars that can drive themselves, smartphones that know face of their users, smart shops that will let you check in and check out yourself (ex: Amazon Just Walk Away, article), and so forth. Let's review some of the most exciting applications of computer vision in different industries.
Computer vision is widely used in whole range of real world and industrial applications.
Let's briefly discuss some of the prominent applications of computer vision.
To day's autonomous vehicles such as self driving cars use computer vision techniques and algorithms at scale. Computer vision gives self driving cars the ability to navigate in the real world, such as being able to detect the surrounding objects/nearby cars, pedestrians and traffic signs. Driveless cars heavily rely on state of the art computer vision techniques such as object detection and image segmentation.
You can learn more about computer vision use cases in driveless cars on the following resources:
Segmentation for Autonomous Driving: Datasets, Methods, and Challenges(Di, Feng et al. 2020)](https://arxiv.org/pdf/1902.07830.pdf) and the paper site.
Computer vision is widely used in medicine to assist medical professionals make better decisions. Example of the related medical applications include diagnosing diseases from medical scans, extracting useful information in medical documents, tumor detection, etc...
If you are interested in AI and Computer vision for medicine, watch this video by Andrew Ng and Pranav, or check Pranav Rajpurkar sites, podcats, and publications.
Machine inspection is one of the prominent applications of computer vision in industries.
Manufacturing industries use computer vision to inspect faults or defects in products under production process. In modern industries, products inspection is fully automated.
Smart shops are using computer vision to automate everything from automated checkout lanes and automatic stocking.
A perfect example for this type of application is Amazon Go. People enter in the shop using barcode(connected to their payment cards) from Amazon Go phone app, pick up any products they want, vision systems detect those products and charge them automaticaly on their virtual cards, the stock updates itself and add more products using sensor fusion, if you decide to take the product back, it is automatically removed to your cart, and lastly you just walk out. Hence the name of technology: Just Walk Out
.
OCR is another widely use of computer vision that is most notable in document processing, reading handritten postal codes on letters and cheques, and automatic number plate recognition(ANPR).
There are many more applications of computer vision and they keep emerging. For example, smartphones have many vision algorithms running inside them such as face recognition, image enhancements, and other image processing related applications.
To learn more about other applications of computer vision, check out this website. .
If we were asked to give the history of computer vision, the majority of us would not go far from popular deep learning algorithms for visual recognition such as convolutional neural networks. Perhaps, some would also take another step back to Imagenet Large Scale Visual Recognition Challenge (ILSVRC) that gave rise to the success of convolutional neural networks and deep learning field in general. Imagenet challenge and convolutional neural networks were two major contributors to the development of research and industrial applications of computer vision.
Computer vision began in early 1960s in some few universities(Stanford, CMU, MIT) that were pioneering Artificial Intelligence. Their goal was to mimic human perception system. In 1966, an undergraduate student at MIT was asked by his professor to "spent the summer linking a camera to a computer and getting the computer to describe what it saw"(Mind as Machine: A History of Cognitive Science,B oden 2006, p. 781
). You can get the copy of the book here. And to quote the book, that was not a joke. It was a serious project, but now we know it's still a difficult task.
Another MIT paper that seems to be linked to such summer project stated that "the final goal is OBJECT IDENTIFICATION which will actually name objects by matching them with a vocabulary of known objects." You can find the Summer Vision Project paper here and the project plan here.
At that time, the one thing that distinguished computer vision from digital image processing was their desire to extract the three dimensional structure from the image in order to understand the full scene.
This is not a complete history. Since 1966, a lot of things happened and research in visual recognition kept happening. For example, in 1989, Yann LeCun created convolutional neural network architecture(LeNet-5) that could be used to recognize handwritten texts such as checks.
Computer vision we know to day is whole different story. The algorithms and tools have advanced in a way that it's pretty simple to use.
For more about history of computer vision, I recommend you to watch the first lecture of CS231n: Convolutional Neural Networks for Visual Recognition, or read the linked resources on the wikipedia page. on the history of the computer vision.
All applications of computer vision we discussed in prior section involves different tasks. Most computer vision tasks falls into recognition, generation, motion estimation, scene reconstruction and image restoration.
Let's discuss the first two tasks categories in details.
Recognition tasks are the most popular computer vision tasks. They involve identifying whether or not a given image/video contain some specific object, feature or a given activity.
Here are examples of popular recognition tasks:
In this type of task, we may be given two images, and the objective be to identify the categories/class of these images.
Object detection involves recognizing the object and localizing it in image.
Image segmentation involves dividing an image pixels into different parts according to their features.
There are 3 types of image segmentation:
Semantic segmentation that associates each pixel of an image with a class label(such as person and car).
Instance segmentation that detect and associates pixels with individual objects.
Panoptic segmentation that is a combination of both semantic and instance segmentation. With panoptic segmentation, we can get more information about image such as number of objects and the bounding boxes. To put it simply, all of the objects are correctly segmented and the remaining parts are correctly labeled. To learn more about this type of segmentation, check out this paper and article.
Video is a series or a sequence of individual images/frames. Video classification is a task of identifying a label that is most relevant to the video given its frames.
Image classification, object detection and image segmentation are the common most computer tasks nowdays. Algorithms wise, Convolutional Neural Networks(a.k.a ConvNets) are deep learning type architecture that has proven to handle recognition tasks(we will learn more about ConvNets in later parts). Vision transformers are also trending, but by far, ConvNets are undoubtedly current and well tested architecture for solving wide variety of image recognition tasks.
Generative networks are special type of neural networks architectures that are used for generating set of data samples that does not exist such as images of people or object. You can virtually see that kind of art on sites like thispersondoesnotexist, thisrentaldoesnotexist, and thischairdoesnotexist.
GANs or Generative Adversarial Networks are made of two main networks - a generator that learns to generate new plausible data, and dicriminator whose job is to learn the difference between the generated data and real data.
GANs are used for various computer vision tasks and in the later parts we will practice them. Below the tasks of generative models in computer vision:
Image: GANs in action: Image colorization. You can tr it here or here.
GANs has lot of applications in computer vision. I highly recommend reading this machinelearningmasterly article and this. You can also check this awesome GANs repo.
Computer vision is a hard problem, and some areas of computer vision involve some sorts of risks. For example: can you fully trust that the self driving car will always break itself when confronted with a truck or a pedestrian?
In 2012, Andrej Karpathy, Director of AI Tesla and the first instructor of Stanford Cs231N wrote a great article about the state of computer vision, and noted that we are really, really far away
. I will borrow the image and ideas from the blog to discuss why computer vision is a hard and still a hard problem to day.
\
Image credit: Andrej Karpathy blog
How would the computer understand the above image like you and I can do understand it? Let's take some typical things that we can draw from the image.
For humans, thinking about all those things can happen in a blink of eye. But it's a different story if we were to build a computer vision system that can interpret everything happening in the image. Take sometime to read the entire blog to get a sense of how computer vision involves a wide variety of knouledge from physics, people and emotion, information about the 3D structure of the scene, people's identities, etc...
There are many more challenges to building effective computer vision systems as well, and some of those are quite rooted in the above scenario.
Here are some other challenges:
Software and hardware tools: Computer vision and deep learning frameworks have advanced, but the heavy work still remains on the engineers building the systems. Also, implementing the vision algorithms require expensive equipments (like Camera, chips, sensors) and very few people can afford them.
Computation power: Training complex deep learning algorithms require huge amount of computation power, and not everyone can afford owning powerful machine learning accelators like GPUs(Graphical Processing Units).
Lack of customized datasets and labelling issues: If you were going to build a fashion classifier, you would not have a problem getting a training images because these images are available for free on the internet. But if you are going to solve a problem that no one solved before, you may have to create your own dataset and this is quite expensive. You are going to need labelling these images (assigning names to images, say image23.png is cat). Labelling can be crowdcourced, but it's expensive and there is a tendency of having errors. Even the popular datasets like Imagenet has labelling errors(see this).
Let's wrap up this intro to computer vision by looking at the vision tools landscape.
Below is a list of computer vision related tools.
The tools are arranged according to their specific tasks.
Matplotlib: Matplotlib is a fantastic Python library for visualizing images and other types of data.
TensorBoard is a TensorFlow's visualization toolkit used to visualize image and other types of data. TensorBoard is not only limited to visualization, it can also be used in debugging machine learning models, visualize model graphs and weights.
Visdom is a flexible tool for visualizing live images and other types of data.
OpenAI Microscope is a collection of visualizations of every significant layer and neuron of 13 important vision models such as AlexNet, Inception v3, VGG(Visual Geometry Group) and ResNet.
NumPy is a scientific computing library used to create and manipulate dimensional arrays. Images are array of pixels, so we can use NumPy to process those pixels.
OpenCV or Open Computer Vision is a powerful image processing library that is supported by different programming languages such as Python, C++, Java, Matlab/Octave, and Javascript. It is orginally written in C++. OpenCV also provides state of the arts computer vision techniques such as object detection.
Scikit-Image is a flexible image processing Python library that offers different functionalities in working with images.
Pillow is a powerful and versatile Python image processing library that also provides a support for different image formats and a range of image processing capabilities.
OpenMMLab is another computer vision libary that provides a wide range of functionalities such as image classification, detection, segmentation and super-resolution.
There are more other image processing tools or related, that was the popular ones. Also, most machine learning frameworks like Tensorflow or PyTorch provides image processing functions. TensorFlow has tf.image and PyTorch has Kornia.
The popular deep learning frameworks that are widely used for building visual recognition algorithms are TensorFlow, Pytorch, Jax and Apache MXNet. For a list of all deep learning frameworks, check this Wikipedia page.
Here are the most important tools or functions provided by most popular deep learning frameworks: TensorFlow and PyTorch.
TensorFlow core library contains functions and layers for building image classification models.
TensorFlow Hub contains a whole range of pretrained models from classification, object detection, image segmentation, etc...It's not advised to train a big deep learning model from scratch, so make use of these pretrained models.
TensorFlow Model Garden contains various computer vision pretrained models such as Mask R-CNN for object detection and segmentation. Check this also.
TensorFlow Object Detection API is an open source framework built on top of TensorFlow that makes it easy to build, train and deploy object detection models.
TensorFlow Datasets contains a range of computer vision datasets.
PyTorch has torchvision, computer vision library that contains popular datasets, vision model architectures, and common image transformation functions.
PyTorch and Facebook AI research has a great object detection and image segmentation library called Detectron. You can try it on this Google Colab Notebook. Check other Pytorch vision models here.
For a more list of computer vision datasets, check this comprehensive website.
This is the end of the introduction to computer vision. If you would like to learn more, you can watch: