Bag of Words (BoW) and TF-IDF in Natural Language Processing (NLP)
Introduction
In Natural Language Processing (NLP), text data needs to be converted into a numerical format before being used for machine learning models. Two fundamental text vectorization techniques used for this purpose are:
β Bag of Words (BoW) β Represents text based on word frequency.
β Term Frequency-Inverse Document Frequency (TF-IDF) β Represents text based on word importance.
These methods help in text classification, sentiment analysis, document retrieval, and other NLP tasks.
1. Bag of Words (BoW)
What is Bag of Words?
The Bag of Words (BoW) model represents text data by counting the occurrences of each word in a document while ignoring grammar and word order.
πΉ Key Features:
β Converts text into numerical form.
β Considers word frequency.
β Ignores word meaning and order.
Steps in Bag of Words
Step 1: Text Preprocessing
π Convert text to lowercase.
π Remove punctuation, stopwords, and special characters.
π Tokenize the text into words.
Example Text:
π "The cat sat on the mat. The dog barked at the cat."
πΉ Tokenized Words: ['the', 'cat', 'sat', 'on', 'mat', 'dog', 'barked', 'at']
Step 2: Creating a Vocabulary
A vocabulary is created from all unique words in the dataset.
Word | Index |
---|---|
the | 0 |
cat | 1 |
sat | 2 |
on | 3 |
mat | 4 |
dog | 5 |
barked | 6 |
at | 7 |
Step 3: Constructing the Word Count Vector
Each sentence is represented as a vector where each element corresponds to a wordβs frequency.
Sentence | the | cat | sat | on | mat | dog | barked | at |
---|---|---|---|---|---|---|---|---|
"The cat sat on the mat" | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
"The dog barked at the cat" | 2 | 1 | 0 | 0 | 0 | 1 | 1 | 1 |
π Each row is a vector representation of a sentence.
Step 4: Implementing Bag of Words in Python
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["The cat sat on the mat.", "The dog barked at the cat."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out()) # Display unique words
print(X.toarray()) # Convert text to numerical representation
πΉ Output:
['at' 'barked' 'cat' 'dog' 'mat' 'on' 'sat' 'the']
[[0 0 1 0 1 1 1 2]
[1 1 1 1 0 0 0 2]]
π Each row represents a sentence, and each column represents a wordβs count.
Advantages of Bag of Words
β Simple and easy to implement.
β Works well for text classification tasks.
β Effective for small datasets.
Disadvantages of Bag of Words
β Ignores word meaning and context.
β Vocabulary size increases with dataset size.
β Treats all words equally, ignoring importance.
2. Term Frequency – Inverse Document Frequency (TF-IDF)
What is TF-IDF?
TF-IDF (Term Frequency – Inverse Document Frequency) is an improved version of BoW that assigns weights to words based on their importance in a document relative to a collection of documents (corpus).
πΉ Key Features:
β Highlights important words in a document.
β Reduces the impact of frequently occurring but unimportant words like “the”, “is”, “and”.
β Normalizes text data for better performance in ML models.
Step 1: Compute Term Frequency (TF)
π Term Frequency (TF) measures how frequently a word appears in a document. TF=Number of times the word appears in a documentTotal number of words in the documentTF = \frac{\text{Number of times the word appears in a document}}{\text{Total number of words in the document}}
β Example:
π "The cat sat on the mat. The dog barked at the cat."
β TF Calculation for word “cat” in the first sentence: TFcat=16=0.167TF_{\text{cat}} = \frac{1}{6} = 0.167
Step 2: Compute Inverse Document Frequency (IDF)
π Inverse Document Frequency (IDF) reduces the weight of frequently occurring words. IDF=logβ‘(Total number of documentsNumber of documents containing the word)IDF = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing the word}}\right)
β Example: If "cat"
appears in 2 out of 2 documents, then: IDFcat=logβ‘(22)=0IDF_{\text{cat}} = \log\left(\frac{2}{2}\right) = 0
β If "barked"
appears in 1 out of 2 documents: IDFbarked=logβ‘(21)=0.301IDF_{\text{barked}} = \log\left(\frac{2}{1}\right) = 0.301
Step 3: Compute TF-IDF Score
TFβIDF=TFΓIDFTF-IDF = TF \times IDF
β Words that appear in many documents get lower weights.
β Words that appear in few documents get higher weights.
Step 4: Implementing TF-IDF in Python
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["The cat sat on the mat.", "The dog barked at the cat."]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out()) # Display words
print(X.toarray()) # Convert to numerical representation
πΉ Output:
['at' 'barked' 'cat' 'dog' 'mat' 'on' 'sat' 'the']
[[0. 0. 0.469 0. 0.617 0.469 0.469 0.374]
[0.526 0.526 0.374 0.526 0. 0. 0. 0.423]]
π Higher weights indicate more important words.
Advantages of TF-IDF
β Reduces the impact of common words.
β Highlights important words.
β Works well for text retrieval tasks.
Disadvantages of TF-IDF
β Doesnβt capture word order or meaning.
β Computationally expensive for large datasets.
3. Comparison: BoW vs. TF-IDF
Feature | Bag of Words (BoW) | TF-IDF |
---|---|---|
Word Frequency | Counts word occurrences | Adjusts weight using IDF |
Handles Common Words | Treats all words equally | Lowers importance of frequent words |
Context Awareness | β No | β No |
Computational Cost | Low | Higher |
Use Case | Text classification | Information retrieval, search engines |
4. Applications of BoW and TF-IDF
β Spam Filtering β Identify spam emails using keyword importance.
β Sentiment Analysis β Classify positive/negative reviews.
β Search Engines β Rank web pages based on relevance.
β Chatbots β Understand user queries using vectorized text.