Word Embeddings (Word2Vec, GloVe) in NLP
Introduction
In Natural Language Processing (NLP), traditional techniques like Bag of Words (BoW) and TF-IDF treat words as independent entities, ignoring their contextual meaning. Word Embeddings solve this issue by representing words in a continuous vector space where similar words have closer representations.
What Are Word Embeddings?
📌 Word Embeddings are numerical representations of words in a dense vector space, capturing relationships between words based on their meaning and context.
🔹 Advantages of Word Embeddings:
✔ Preserve semantic meaning.
✔ Capture word similarity and relationships.
✔ Reduce dimensionality compared to sparse representations like BoW.
✔ Enable better generalization in machine learning models.
🚀 Popular Word Embedding Techniques:
✔ Word2Vec (Developed by Google)
✔ GloVe (Global Vectors) (Developed by Stanford)
1. Word2Vec (Word to Vector)
📌 Developed by Google in 2013, Word2Vec transforms words into vectors based on the context in which they appear. It is an unsupervised learning algorithm that captures the semantic meaning of words using two main models:
✔ Continuous Bag of Words (CBOW)
✔ Skip-Gram Model
1.1 Continuous Bag of Words (CBOW)
📌 CBOW predicts a target word using its surrounding context words.
Example:
✅ Sentence: "The cat sat on the mat."
✅ Target word: "sat"
✅ Context words: ["The", "cat", "on", "the", "mat"]
📌 CBOW Training:
- The model takes the context words (
The, cat, on, the, mat
) as input. - It predicts the target word (
sat
). - It adjusts word vectors to optimize predictions.
📌 CBOW Characteristics:
✔ Faster training compared to Skip-Gram.
✔ Works better for smaller datasets.
✔ Performs well when context is available.
1.2 Skip-Gram Model
📌 Skip-Gram predicts surrounding words given a target word.
Example:
✅ Sentence: "The cat sat on the mat."
✅ Target word: "sat"
✅ Predicted context words: ["The", "cat", "on", "the", "mat"]
📌 Skip-Gram Training:
- The model takes the target word (
sat
) as input. - It predicts multiple context words.
- It learns word representations by adjusting word vectors.
📌 Skip-Gram Characteristics:
✔ Works better for infrequent words.
✔ Performs well on larger datasets.
✔ Computationally expensive compared to CBOW.
1.3 Word2Vec Training in Python
We can train Word2Vec using the Gensim library.
Step 1: Install Dependencies
pip install gensim
Step 2: Import Libraries
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
Step 3: Define Corpus & Tokenize
corpus = ["The cat sat on the mat.",
"The dog barked at the cat.",
"The cat chased the mouse.",
"Dogs and cats are common pets."]
# Tokenizing sentences
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]
Step 4: Train Word2Vec Model
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)
📌 Hyperparameters: ✔ vector_size=100
→ Dimension of word vectors
✔ window=5
→ Context window size
✔ min_count=1
→ Ignores words with occurrences < 1
✔ workers=4
→ Number of CPU threads
Step 5: Get Word Embeddings
print(model.wv["cat"]) # Get vector for "cat"
Step 6: Find Similar Words
print(model.wv.most_similar("cat")) # Find words similar to "cat"
2. GloVe (Global Vectors for Word Representation)
📌 Developed by Stanford, GloVe learns word embeddings based on word co-occurrence in a large corpus.
🔹 How GloVe Works:
✔ Creates a word co-occurrence matrix, counting how often words appear together.
✔ Factorizes the matrix to produce word vectors.
✔ Captures global statistics rather than relying only on local context like Word2Vec.
2.1 Difference Between Word2Vec and GloVe
Feature | Word2Vec | GloVe |
---|---|---|
Approach | Predicts words using context (CBOW/Skip-Gram) | Factorizes word co-occurrence matrix |
Captures Local/Global Info | Local context-based | Global word co-occurrence |
Performance on Small Data | Good | Needs large corpus |
Computational Efficiency | Faster | Slower |
2.2 Using Pre-trained GloVe Vectors
📌 Stanford provides pre-trained GloVe embeddings (trained on large datasets like Wikipedia).
Step 1: Download Pre-trained GloVe Vectors
wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip
Step 2: Load GloVe Vectors
import numpy as np
# Load word embeddings
glove_dict = {}
with open("glove.6B.100d.txt", encoding="utf-8") as f:
for line in f:
values = line.split()
word = values[0]
vector = np.array(values[1:], dtype="float32")
glove_dict[word] = vector
# Check embedding for "cat"
print(glove_dict["cat"])
3. Applications of Word Embeddings
🚀 Word Embeddings power many NLP applications:
✔ Chatbots & Virtual Assistants – Understanding queries.
✔ Machine Translation – Capturing word relationships.
✔ Text Classification – Sentiment analysis, spam filtering.
✔ Recommendation Systems – Identifying similar content.
✔ Question-Answering Systems – Context-aware responses.
4. Comparison: Word2Vec vs. GloVe
Feature | Word2Vec | GloVe |
---|---|---|
Training Method | Predictive (CBOW, Skip-Gram) | Matrix Factorization |
Captures Meaning? | Yes | Yes |
Training Speed | Faster | Slower |
Handles Rare Words | Better | Requires large corpus |
Performance on Large Data | Performs well | Performs better |
5. Key Takeaways
✅ Word Embeddings help capture semantic meaning in text.
✅ Word2Vec (CBOW, Skip-Gram) predicts words based on context.
✅ GloVe creates embeddings from word co-occurrence statistics.
✅ Both techniques power NLP applications like chatbots, sentiment analysis, and search engines.
📌 Next Steps: Want to explore how to fine-tune embeddings for your dataset? Let me know!