Bag of Words and TF-IDF

Loading

Bag of Words (BoW) and TF-IDF in Natural Language Processing (NLP)

Introduction

In Natural Language Processing (NLP), text data needs to be converted into a numerical format before being used for machine learning models. Two fundamental text vectorization techniques used for this purpose are:
βœ” Bag of Words (BoW) – Represents text based on word frequency.
βœ” Term Frequency-Inverse Document Frequency (TF-IDF) – Represents text based on word importance.

These methods help in text classification, sentiment analysis, document retrieval, and other NLP tasks.


1. Bag of Words (BoW)

What is Bag of Words?

The Bag of Words (BoW) model represents text data by counting the occurrences of each word in a document while ignoring grammar and word order.

πŸ”Ή Key Features:
βœ” Converts text into numerical form.
βœ” Considers word frequency.
βœ” Ignores word meaning and order.


Steps in Bag of Words

Step 1: Text Preprocessing

πŸ“Œ Convert text to lowercase.
πŸ“Œ Remove punctuation, stopwords, and special characters.
πŸ“Œ Tokenize the text into words.

Example Text:
πŸ“œ "The cat sat on the mat. The dog barked at the cat."
πŸ”Ή Tokenized Words: ['the', 'cat', 'sat', 'on', 'mat', 'dog', 'barked', 'at']


Step 2: Creating a Vocabulary

A vocabulary is created from all unique words in the dataset.

WordIndex
the0
cat1
sat2
on3
mat4
dog5
barked6
at7

Step 3: Constructing the Word Count Vector

Each sentence is represented as a vector where each element corresponds to a word’s frequency.

Sentencethecatsatonmatdogbarkedat
"The cat sat on the mat"21111000
"The dog barked at the cat"21000111

πŸ“Œ Each row is a vector representation of a sentence.


Step 4: Implementing Bag of Words in Python

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["The cat sat on the mat.", "The dog barked at the cat."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())  # Display unique words
print(X.toarray())  # Convert text to numerical representation

πŸ”Ή Output:

['at' 'barked' 'cat' 'dog' 'mat' 'on' 'sat' 'the']
[[0 0 1 0 1 1 1 2]
 [1 1 1 1 0 0 0 2]]

πŸ“Œ Each row represents a sentence, and each column represents a word’s count.


Advantages of Bag of Words

βœ” Simple and easy to implement.
βœ” Works well for text classification tasks.
βœ” Effective for small datasets.

Disadvantages of Bag of Words

❌ Ignores word meaning and context.
❌ Vocabulary size increases with dataset size.
❌ Treats all words equally, ignoring importance.


2. Term Frequency – Inverse Document Frequency (TF-IDF)

What is TF-IDF?

TF-IDF (Term Frequency – Inverse Document Frequency) is an improved version of BoW that assigns weights to words based on their importance in a document relative to a collection of documents (corpus).

πŸ”Ή Key Features:
βœ” Highlights important words in a document.
βœ” Reduces the impact of frequently occurring but unimportant words like “the”, “is”, “and”.
βœ” Normalizes text data for better performance in ML models.


Step 1: Compute Term Frequency (TF)

πŸ“Œ Term Frequency (TF) measures how frequently a word appears in a document. TF=Number of times the word appears in a documentTotal number of words in the documentTF = \frac{\text{Number of times the word appears in a document}}{\text{Total number of words in the document}}

βœ” Example:
πŸ“œ "The cat sat on the mat. The dog barked at the cat."
βœ” TF Calculation for word “cat” in the first sentence: TFcat=16=0.167TF_{\text{cat}} = \frac{1}{6} = 0.167


Step 2: Compute Inverse Document Frequency (IDF)

πŸ“Œ Inverse Document Frequency (IDF) reduces the weight of frequently occurring words. IDF=log⁑(Total number of documentsNumber of documents containing the word)IDF = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing the word}}\right)

βœ” Example: If "cat" appears in 2 out of 2 documents, then: IDFcat=log⁑(22)=0IDF_{\text{cat}} = \log\left(\frac{2}{2}\right) = 0

βœ” If "barked" appears in 1 out of 2 documents: IDFbarked=log⁑(21)=0.301IDF_{\text{barked}} = \log\left(\frac{2}{1}\right) = 0.301


Step 3: Compute TF-IDF Score

TFβˆ’IDF=TFΓ—IDFTF-IDF = TF \times IDF

βœ” Words that appear in many documents get lower weights.
βœ” Words that appear in few documents get higher weights.


Step 4: Implementing TF-IDF in Python

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["The cat sat on the mat.", "The dog barked at the cat."]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())  # Display words
print(X.toarray())  # Convert to numerical representation

πŸ”Ή Output:

['at' 'barked' 'cat' 'dog' 'mat' 'on' 'sat' 'the']
[[0.     0.     0.469 0.     0.617 0.469 0.469 0.374]
 [0.526 0.526 0.374 0.526 0.     0.     0.     0.423]]

πŸ“Œ Higher weights indicate more important words.


Advantages of TF-IDF

βœ” Reduces the impact of common words.
βœ” Highlights important words.
βœ” Works well for text retrieval tasks.

Disadvantages of TF-IDF

❌ Doesn’t capture word order or meaning.
❌ Computationally expensive for large datasets.


3. Comparison: BoW vs. TF-IDF

FeatureBag of Words (BoW)TF-IDF
Word FrequencyCounts word occurrencesAdjusts weight using IDF
Handles Common WordsTreats all words equallyLowers importance of frequent words
Context Awareness❌ No❌ No
Computational CostLowHigher
Use CaseText classificationInformation retrieval, search engines

4. Applications of BoW and TF-IDF

βœ” Spam Filtering – Identify spam emails using keyword importance.
βœ” Sentiment Analysis – Classify positive/negative reviews.
βœ” Search Engines – Rank web pages based on relevance.
βœ” Chatbots – Understand user queries using vectorized text.


Leave a Reply

Your email address will not be published. Required fields are marked *