From the course: Natural Language Processing for Speech and Text: From Beginner to Advanced
Unlock the full course today
Join today to access over 24,700 courses taught by industry experts.
Text representation: Bag-of-words (BoW) - Python Tutorial
From the course: Natural Language Processing for Speech and Text: From Beginner to Advanced
Text representation: Bag-of-words (BoW)
- [Instructor] In previous videos, we have learned about one-hot encoding and engrams for text representation. If you're already thinking, what about full documents? You are right. This is where bag-of-words, or BoW, comes in. Bag-of-words represent text data by considering the frequency of tokens in a document. So, in a corpus, which is a collection of documents, each document is represented as a vector of word count, with each dimension representing specific words from the vocabulary. With bag-of-words, the focus is on token counts. The order and grammatical structure is disregarded. Consider these three different sentences. Natural language processing for speech and text. Language processing for speech and text. Text and speech for natural language processing. They have exactly the same count, one instance of the words, an instance of natural, language, processing, for, speech, and, text. Even though they contain the same element in the same count, the order has changed their…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.
Contents
-
-
-
-
(Locked)
Text preprocessing3m 6s
-
Text preprocessing using NLTK7m 10s
-
(Locked)
Text representation2m 18s
-
(Locked)
Text representation: One-hot encoding2m 6s
-
(Locked)
One-hot encoding using scikit-learn3m 32s
-
(Locked)
Text representation: N-grams2m 21s
-
(Locked)
N-grams representation using NLTK3m 3s
-
(Locked)
Text representation: Bag-of-words (BoW)2m 1s
-
(Locked)
Bag-of-words representation using scikit-learn2m 29s
-
(Locked)
Text representation: Term frequency-inverse document frequency (TF-IDF)1m 50s
-
(Locked)
TF-IDF representation using scikit-learn2m 8s
-
(Locked)
Text representation: Word embeddings2m 56s
-
(Locked)
Word2vec embedding using Gensim9m 8s
-
(Locked)
Embedding with pretrained spaCy model5m 7s
-
(Locked)
Sentence embedding using the Sentence Transformers library3m 42s
-
(Locked)
Text representation: Pre-trained language models (PLMs)2m 34s
-
(Locked)
Pre-trained language models using Transformers5m 43s
-
(Locked)
-
-
-