Bag of Words (BoW) and TF-IDF in Natural Language Processing (NLP)

Introduction

In Natural Language Processing (NLP), text data needs to be converted into a numerical format before being used for machine learning models. Two fundamental text vectorization techniques used for this purpose are:
✔ Bag of Words (BoW) – Represents text based on word frequency.
✔ Term Frequency-Inverse Document Frequency (TF-IDF) – Represents text based on word importance.

These methods help in text classification, sentiment analysis, document retrieval, and other NLP tasks.

1. Bag of Words (BoW)

What is Bag of Words?

The Bag of Words (BoW) model represents text data by counting the occurrences of each word in a document while ignoring grammar and word order.

🔹 Key Features:
✔ Converts text into numerical form.
✔ Considers word frequency.
✔ Ignores word meaning and order.

Steps in Bag of Words

Step 1: Text Preprocessing

📌 Convert text to lowercase.
📌 Remove punctuation, stopwords, and special characters.
📌 Tokenize the text into words.

Example Text:
📜 "The cat sat on the mat. The dog barked at the cat."
🔹 Tokenized Words: ['the', 'cat', 'sat', 'on', 'mat', 'dog', 'barked', 'at']

Step 2: Creating a Vocabulary

A vocabulary is created from all unique words in the dataset.

Word	Index
the	0
cat	1
sat	2
on	3
mat	4
dog	5
barked	6
at	7

Step 3: Constructing the Word Count Vector

Each sentence is represented as a vector where each element corresponds to a word’s frequency.

Sentence	the	cat	sat	on	mat	dog	barked	at
`"The cat sat on the mat"`	2	1	1	1	1	0	0	0
`"The dog barked at the cat"`	2	1	0	0	0	1	1	1

📌 Each row is a vector representation of a sentence.

Step 4: Implementing Bag of Words in Python

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["The cat sat on the mat.", "The dog barked at the cat."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())  # Display unique words
print(X.toarray())  # Convert text to numerical representation

🔹 Output:

['at' 'barked' 'cat' 'dog' 'mat' 'on' 'sat' 'the']
[[0 0 1 0 1 1 1 2]
 [1 1 1 1 0 0 0 2]]

📌 Each row represents a sentence, and each column represents a word’s count.

Advantages of Bag of Words

✔ Simple and easy to implement.
✔ Works well for text classification tasks.
✔ Effective for small datasets.

Disadvantages of Bag of Words

❌ Ignores word meaning and context.
❌ Vocabulary size increases with dataset size.
❌ Treats all words equally, ignoring importance.

2. Term Frequency – Inverse Document Frequency (TF-IDF)

What is TF-IDF?

TF-IDF (Term Frequency – Inverse Document Frequency) is an improved version of BoW that assigns weights to words based on their importance in a document relative to a collection of documents (corpus).

🔹 Key Features:
✔ Highlights important words in a document.
✔ Reduces the impact of frequently occurring but unimportant words like “the”, “is”, “and”.
✔ Normalizes text data for better performance in ML models.

Step 1: Compute Term Frequency (TF)

📌 Term Frequency (TF) measures how frequently a word appears in a document. TF=Number of times the word appears in a documentTotal number of words in the documentTF = \frac{\text{Number of times the word appears in a document}}{\text{Total number of words in the document}}

✔ Example:
📜 "The cat sat on the mat. The dog barked at the cat."
✔ TF Calculation for word “cat” in the first sentence: TFcat=16=0.167TF_{\text{cat}} = \frac{1}{6} = 0.167

Step 2: Compute Inverse Document Frequency (IDF)

📌 Inverse Document Frequency (IDF) reduces the weight of frequently occurring words. IDF=log⁡(Total number of documentsNumber of documents containing the word)IDF = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing the word}}\right)

✔ Example: If "cat" appears in 2 out of 2 documents, then: IDFcat=log⁡(22)=0IDF_{\text{cat}} = \log\left(\frac{2}{2}\right) = 0

✔ If "barked" appears in 1 out of 2 documents: IDFbarked=log⁡(21)=0.301IDF_{\text{barked}} = \log\left(\frac{2}{1}\right) = 0.301

Step 3: Compute TF-IDF Score

TF−IDF=TF×IDFTF-IDF = TF \times IDF

✔ Words that appear in many documents get lower weights.
✔ Words that appear in few documents get higher weights.

Step 4: Implementing TF-IDF in Python

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["The cat sat on the mat.", "The dog barked at the cat."]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())  # Display words
print(X.toarray())  # Convert to numerical representation

🔹 Output:

['at' 'barked' 'cat' 'dog' 'mat' 'on' 'sat' 'the']
[[0.     0.     0.469 0.     0.617 0.469 0.469 0.374]
 [0.526 0.526 0.374 0.526 0.     0.     0.     0.423]]

📌 Higher weights indicate more important words.

Advantages of TF-IDF

✔ Reduces the impact of common words.
✔ Highlights important words.
✔ Works well for text retrieval tasks.

Disadvantages of TF-IDF

❌ Doesn’t capture word order or meaning.
❌ Computationally expensive for large datasets.

3. Comparison: BoW vs. TF-IDF

Feature	Bag of Words (BoW)	TF-IDF
Word Frequency	Counts word occurrences	Adjusts weight using IDF
Handles Common Words	Treats all words equally	Lowers importance of frequent words
Context Awareness	❌ No	❌ No
Computational Cost	Low	Higher
Use Case	Text classification	Information retrieval, search engines

4. Applications of BoW and TF-IDF

✔ Spam Filtering – Identify spam emails using keyword importance.
✔ Sentiment Analysis – Classify positive/negative reviews.
✔ Search Engines – Rank web pages based on relevance.
✔ Chatbots – Understand user queries using vectorized text.

Bag of Words and TF-IDF