From the course: TensorFlow: Working with NLP

Unlock the full course today

Join today to access over 24,700 courses taught by industry experts.

How was BERT trained?

How was BERT trained?

- [Instructor] So what was BERT and GPT-2 trained on? Any bias and prejudice in these data sources will make their way to the models. BERT was trained on the English Wikipedia, which has around two and a half billion words, and something known as the BookCorpus, which is around 800 million words. The BooksCorpus are 11,000 books written by yet unpublished authors. GPT-2 was trained on WebText Corpus. The researchers at OpenAI created the WebText Corpus by scraping all outbound links from Reddit, which is a social media platform, which received at least three Karma points. They did this because this was an indicator for whether the other users found the link interesting, educational or just funny. So the WebText Corpus contains the text subset of these 45 million links and doesn't include links created after December, 2017. Now, if you read the BERT paper, the two key contributions of BERT are these two tasks, Masked…

Contents