From the course: Introduction to Attention-Based Neural Networks

Language generation models

- [Instructor] In this video, we'll get a quick overview of how sequence to sequence models work for language generation. The principles and structure that we'll discuss for the recurrent neural network here are just general principles for language generation models. We'll specifically discuss language translation models in the next video. Now, language generation models. The input is a sequence of words. The output is a sequence of words. The word that is generated at some time instance t-1 is fed as an input to get the next word in the sequence at time t. You'll find that this is a pretty standard set up for any language generation model. Because sentences are sequences, you can use the word at the current time instance to predict the word at the next time instance. Here is an architectural overview of what a language generation model might look like. Now, this is an unrolled RNN where we are feeding in different words as input at different time instances. Neural networks only understand numeric input, which means your input words have to be converted to numeric form. And the most common representation for input words is one-hot encoding. So the words input to the RNN at different time instances are one-hot vectors. The dimensionality or the size of this one-hot vector depends of course on the size of your vocabulary. It's common practice in an RNN to have an additional embedding layer that accepts one-hot vector representations of words and converts each word to an embedding a lower dimensionality representation. The weights of the embedding layer is also trained during the training of the RNN. The individual words in the sequence are input to the RNN cell, which is then unrolled through time. The number of layers in the RNN is equal to the number of elements that we have in the input sequence. So if the input sequence has 20 words, that's the length of your sentence, your RNN will have 20 layers. Typically, the first word in the input sequence is used to get a prediction that is the second word of the sequence. Next, we have the second word in the sequence used to predict the third word of the sequence. Observe how the hidden state of the RNN is fed through to the next layer. Because of the sequential nature of sentences and the feeding of this hidden state from one layer to the other, the output at each time step depends on all of the words that have been seen so far in the input sequence. What the RNN outputs at every time step, that is any Y that you see here in this diagram, is actually a probability distribution over all of the words in the output vocabulary. And this probability distribution gives us the likelihood of what the next predicted word might be. Each input that you feed in at a time step produces an output probability distribution over all possible words from the vocabulary. Of all of the words in the vocabulary, what is the most likely word at this time step? This probability distribution depends on all of the words in the sequence seen so far. So the probability of a particular word at the output depends on all of the words seen so far in the sequence. Given all of the words seen so far, what is the most likely next word in the vocabulary? Here, V of i is the ith word in the vocabulary generated at time instance t. So how do you train a language generation model? At every step, the output is a probability distribution over the words in the vocabulary. You compare the word generated by the model with the actual next word in the sequence. That is part of your training data. You typically use a loss function, such as the cross-entropy loss function that allows us to compute the divergence between the word predicted by our model and the actual word from the training data. So at any point in time, we are trying to predict the next word in the sequence. In order to generate the sequence using the model, let's say we've seen three words so far, and then we have the output here at time instance four. We know that the output at every time step is a probability distribution over all possible words in the vocabulary. We look at the probability distribution and pick the word that is most likely at this particular time step t. Now that we have a word, let's say that's Y4, that is then fed back to the next layer of the RNN as an input and used to generate the next word in the sequence. And this continues till we generate all words in the sentence.

Contents