Demystifying Machine Learning

Let's explore machine learning, where the only thing more baffling than the jargon is why we haven't let the machines take over yet.

Quick Introduction

If you call Earth home, you've probably heard the buzz about "machine learning." And no, pun intended - it really is all around us!

So, what exactly is machine learning? Machine learning is a subfield of artificial intelligence (AI) focused on building systems that learn from data, identify patterns, and make decisions with minimal human intervention. Essentially, machine learning involves training a computer model using large amounts of data and algorithms that allow it to learn how to perform a specific task.

There are plenty of online resources that explain various types of machine learning and other related nuances about the tech, including its pros and cons, limitations, and capabilities. However, that’s not the intent of this post. In this post, we’re going to focus exclusively on the ‘learning’ aspect of technology. So we’ll try not to get entangled in the jargon associated with the technology.

There are various methods for learning from data across different types of machine learning solutions. In this article, however, our focus will be specifically on computer vision.

The Learning Aspect of Machine Learning

When my daughter was about a year old, she often watched YouTube videos featuring vibrant images of fruits and vegetables, accompanied by their names spoken aloud. This content looped repeatedly, and she seemed to enjoy it. At that time, I'm not entirely sure if she grasped the concepts of shape, size, and other distinguishing characteristics that set an apple apart from, say, a banana. Nevertheless, after seeing these repeated presentations, she began to develop an understanding of what an apple was and how it differed from other fruits.

As adults, we can easily enumerate the attributes that differentiate one fruit from another. However, even young children, who lack a complete grasp of these attributes, can still distinguish between various objects. The specific criteria or mental models they use to make these distinctions remain a fascinating question, possibly one best addressed by neuroscientists.

In a way, this process mirrors the essence of machine learning, where systems learn from repeated exposure to data, gradually improving their ability to differentiate and recognize patterns without explicit instructions on the underlying characteristics.

How an image recognition system would have identified it as an apple

Just like the above, we need to give lots of images of apples to an image recognition system being trained and tell it that it’s an apple or banana (aka labels). But what does the model under training see in those training images and what does it do with that.

Here is how the input looks to the system being trained for image recognition.

Hmm, that doesn't quite resemble an apple. Even the colors displayed are mainly for human readability. In reality, what the training algorithm receives is a matrix of numbers representing the three-color channels: red, green, and blue (RGB). Each number within this matrix corresponds to a pixel, the fundamental component of any digital image. The color of each pixel is determined by a mix of RGB values, each typically ranging from 0 to 255 in an 8-bit color depth system. By altering the intensity of these primary colors, a broad array of colors can be generated.

So, we feed the algorithm numerous images of apples (in the form of these numerical matrices) and label them as 'apples’. What does the training algorithm do with this information? It performs calculations to establish a standard—a yardstick, if you will—that enables it to distinguish between an apple and any other object.

Let’s see how it does it.

In machine learning lingo the architecture used to train a machine to recognize images is called Convolutional Neural Network (CNN).

High-level architecture of the convolutional neural network

Input consists of images known as training data, represented as matrices of numbers associated with each image.
Each circle represents a small computational unit called an activation.
These computational units are stacked, with the innermost ones called hidden layers, and the first and last ones called the input and output layers, respectively.
There are several different types of architectures, each with differences in the number of layers, their connections, and the activation functions they use.
Several image processing techniques are used. We'll focus on the following two important ones: Convolution: This involves using a small matrix called a kernel to operate on the input matrix, which in this case is an image of an apple. Imagine a small window (called a filter or kernel) sliding over the entire image. At each position, the filter multiplies its values by the original pixel values of the image currently beneath it, sums these results into a single number, and writes this number into a new image (known as a feature map). This process highlights certain features in the image, such as edges, colors, or textures, depending on what the filter is designed to capture. Essentially, convolution helps the network focus on important spatial hierarchies in an image.

This animation helps visualize the convolution operation better:

Pooling: CNN figures out features for parts of images and then combines those parts to understand larger and larger parts. Pooling reduces the size of the feature map extracted above. It slides over the convoluted matrix by taking either the max or average in a 2x2 or other specified size. The yellow box keeps sliding over the matrix and reduces the size by taking the max or average of all the numbers in the box.

As the model keeps learning on different set of images, the convolution filters keep getting adjusted, until some threshold is reached. There are several variables that control this process. But for the sake of simplicity, we can say that convolutional filters keep getting adjusted.
When we talk about learning, it means adjusting the weights, i.e., the filters, based on a comparison between the predicted outcome and the actual outcomes (labels). This comparison is done using something called a loss function. Then, there is a backward pass to adjust the weights.

The result of the series of convolution and pooling is a flat list of numbers called logits.

The output layer converts these logits to probabilities by exponentiation, ensuring that the final numbers are not negative and fall within the range of 0 and 1.

For example, if we’re training the model to identify apples, bananas, and cherries, the logits could look like [2.0, -1.0, 0.5], which are then transformed into probabilities like [0.8, 0.05, 0.15] by the softmax function.
Once the model is trained, we provide an image of a fruit, and it goes through the calculations to output the probabilities. We then set a condition: if the model predicts a probability of over 70% for the image being an apple, then we output 'apple'.

Revisiting the comparison between how computers and humans recognize images: In computer vision, the yardstick that enables a system to discern an apple from other fruits is a mathematical model. This model interprets an image—essentially a grid of numerical values—and processes it through a series of calculations. The outcome is a set of probabilities. If the calculated probability for "apple" falls within a predefined range, the model concludes there’s a certain percentage chance the image is, indeed, an apple.

But what does the model learn during its training? It picks up on distinctive apple features from the training images, such as shape, edges, and color patterns. Humans, while not explicitly performing these mathematical calculations, have some abstract representation of the objects and use that to distinguish them from others.

An Illustration

I asked ChatGPT, giving an unusual image that partly looked like an apple. As expected, it gave a proper description of the image based on its knowledge of other objects.

ChatGPT description of an image with an apple-like appearance

Localization, Object Detection, and Object Tracking

In real-world use cases, computer vision systems not only need to identify what object is in an image but also specify 'where' the object is located. This is achieved by delineating the recognized object with a boundary box within the image, enhancing our understanding with spatial context.

In practical scenarios, especially those involving multiple objects of various classes, complexity increases. Take a self-driving car, for example, where the onboard computer vision system is tasked with recognizing and localizing a plethora of objects like pedestrians, other vehicles, and motorcycles, against the backdrop of an ever-changing environment. To manage this, the model doesn't just classify these objects into distinct categories using the SoftMax activation function in the output layer; it goes a step further.

Localization comes into play here, making the model draw virtual boxes around each identified object, which requires generating additional output values. Specifically, for each detected object, the model computes four more values. These values define the parameters of the bounding box: the coordinates of the box's center, its width, and its height. Thus, the model learns not just to recognize an object but also to understand its precise location within the visual space of the image, a critical capability for systems that interact with the real world.

Localization, Object Detection, and Object Tracking

It involves using techniques like the Sliding-Window. Take the example of car detection.
- Create a labeled training set of carefully cropped images that only contain a car.
- Use that training data to train a convolutional neural network (conv net) to detect cars.
- Take a small sliding window in the original image and give it to the network to check if there's a car or a pedestrian.
There are several other techniques that we can’t cover in this post, such as intersection over union, Anchor boxes, etc.

Object Tracking: Another important use case in this context is object tracking, which essentially means detecting objects as they move.

Conclusion

In closing, we've just begun to uncover the essentials of computer vision, exploring how it discerns and pinpoints objects through a digital lens. Think of self-driving cars navigating bustling streets or smartphones recognizing faces — these are real-world applications harnessing the power we've discussed. As we move forward, look forward to our deep dive into the realm of Large Language Models (LLMs), revealing another facet of AI's capabilities. More to come soon—stay connected!