adporn.net Vision models 101 - Advanced RAG Applications with Vector Databases Video Tutorial | LinkedIn Learning, formerly Lynda.com

LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

Select Accept to consent or Reject to decline non-essential cookies for this use. You can update your choices at any time in your settings.

Start free trial Sign in

From the course: Advanced RAG Applications with Vector Databases

Vision models 101

From the course: Advanced RAG Applications with Vector Databases

Start my 1-month free trial Buy for my team

Vision models 101

“

- [Instructor] In order to understand how machines compare images, we're going to do a crash course of how vision models work. Remember that vision models are just a type of deep neural network. They are deep neural networks that are trained for computer vision tasks, such as classification of images, segmentation, or object detection. Let's go back in time a bit. Back in the 1960s, we got our first set of neural networks. The first version of neural networks were simple. They were modeled as layers of neurons in which each layer was fully connected to the one coming before and after it. As machine learning progressed, we learned that modeling neural networks in different ways gave better results for different types of data. When it comes to vision data, we found that this technique called a convolution was helpful in getting local context and getting machines to decipher images. In 1993, the first paper to use max pooling, a way to collect the output of convolution layers was published. This was the most common type of vision model for nearly 30 years. Then in 2019, 1 year after the original transformers paper, a paper that was on how to use transformer models for language, a new paper on vision transformers came out. Vision transformers took the attention mechanism introduced in the original transformers paper and applied it to computer vision. So what is a convolutional neural network? The defining architecture of convolutional neural networks is the combination of a convolutional layer and a pooling layer. These two layers help us get context from different places in the image and combine all these local contexts to make sense of the image. Let's take a look at what this looks like visually. Imagine you have a 2D image filled with numbers like the one shown here. This picture shows how a convolution might work. In this case, we are looking at a three by three convolution. Each convolution has a filter, and this filter is learnable, as in it will change depending on how you train the model. The top part of the image shows the result of a convolution, and the bottom part shows how it's done. You take the value of each entry in a convolution and multiply it element wise with each value in the filter. Then you add up all of the values in the resulting square and use that as the result for the square that the convolution is based around in the resulting image. So that's a convolution and this is pooling. Max pooling is a bit less complicated than a convolution. As shown in the image, all you need to do is take the max of a value in a pool and use that to represent the entire region. In the red region, we have 12, 20, 8, and 12. So we take 20 as the value to represent that region. Today's vision model zeitgeist is vision transformers, derived from the classic transformer model of using an encoder, a decoder, and an attention mechanism on an input sequence, vision transformers takes the inspiration drawn from language and applies it to computer vision. Much like the idea of convolutional filters, vision transformers operate in patches. These patches are N by N squares that each make up a piece of the image. Each of these patches are turned into embeddings, as we talked about earlier in the course. Then these embeddings are put together into the encoder and mixed with the intention mechanism as they are fed into the decoder. Here we see an illustration of how vision transformers work. In the bottom left, we see an image split into nine patches. We take these nine patches and turn them into patch embeddings. These patch embeddings, often combined with a class token, denoted CLS in the diagram, are then combined with a positional encoding and fed into the transformer. The output is then fed into a multilayer perceptron, also known as a fully connected neural network, denoted as the MLP head in the image, and out comes logits that describe the image, typically some object detection or segmentation type task.

Contents

- (Locked)
  
  Next steps
  
  26s