From the course: Advanced RAG Applications with Vector Databases
Vision models 101
From the course: Advanced RAG Applications with Vector Databases
Vision models 101
- [Instructor] In order to understand how machines compare images, we're going to do a crash course of how vision models work. Remember that vision models are just a type of deep neural network. They are deep neural networks that are trained for computer vision tasks, such as classification of images, segmentation, or object detection. Let's go back in time a bit. Back in the 1960s, we got our first set of neural networks. The first version of neural networks were simple. They were modeled as layers of neurons in which each layer was fully connected to the one coming before and after it. As machine learning progressed, we learned that modeling neural networks in different ways gave better results for different types of data. When it comes to vision data, we found that this technique called a convolution was helpful in getting local context and getting machines to decipher images. In 1993, the first paper to use max pooling, a way to collect the output of convolution layers was published. This was the most common type of vision model for nearly 30 years. Then in 2019, 1 year after the original transformers paper, a paper that was on how to use transformer models for language, a new paper on vision transformers came out. Vision transformers took the attention mechanism introduced in the original transformers paper and applied it to computer vision. So what is a convolutional neural network? The defining architecture of convolutional neural networks is the combination of a convolutional layer and a pooling layer. These two layers help us get context from different places in the image and combine all these local contexts to make sense of the image. Let's take a look at what this looks like visually. Imagine you have a 2D image filled with numbers like the one shown here. This picture shows how a convolution might work. In this case, we are looking at a three by three convolution. Each convolution has a filter, and this filter is learnable, as in it will change depending on how you train the model. The top part of the image shows the result of a convolution, and the bottom part shows how it's done. You take the value of each entry in a convolution and multiply it element wise with each value in the filter. Then you add up all of the values in the resulting square and use that as the result for the square that the convolution is based around in the resulting image. So that's a convolution and this is pooling. Max pooling is a bit less complicated than a convolution. As shown in the image, all you need to do is take the max of a value in a pool and use that to represent the entire region. In the red region, we have 12, 20, 8, and 12. So we take 20 as the value to represent that region. Today's vision model zeitgeist is vision transformers, derived from the classic transformer model of using an encoder, a decoder, and an attention mechanism on an input sequence, vision transformers takes the inspiration drawn from language and applies it to computer vision. Much like the idea of convolutional filters, vision transformers operate in patches. These patches are N by N squares that each make up a piece of the image. Each of these patches are turned into embeddings, as we talked about earlier in the course. Then these embeddings are put together into the encoder and mixed with the intention mechanism as they are fed into the decoder. Here we see an illustration of how vision transformers work. In the bottom left, we see an image split into nine patches. We take these nine patches and turn them into patch embeddings. These patch embeddings, often combined with a class token, denoted CLS in the diagram, are then combined with a positional encoding and fed into the transformer. The output is then fed into a multilayer perceptron, also known as a fully connected neural network, denoted as the MLP head in the image, and out comes logits that describe the image, typically some object detection or segmentation type task.
Contents
-
-
-
-
(Locked)
Introduction to vector embeddings for images2m 8s
-
Vision models 1014m 58s
-
(Locked)
Demo: Getting semantic vectors57s
-
(Locked)
Demo: Storing image vectors1m 10s
-
(Locked)
Demo: Comparing images semantically46s
-
(Locked)
Challenge: Find the dog most similar to a cat42s
-
(Locked)
Solution: Find the dog most similar to a cat1m 46s
-
(Locked)
-
-