Word Embeddings (Word2Vec, GloVe) in NLP

Introduction

In Natural Language Processing (NLP), traditional techniques like Bag of Words (BoW) and TF-IDF treat words as independent entities, ignoring their contextual meaning. Word Embeddings solve this issue by representing words in a continuous vector space where similar words have closer representations.

What Are Word Embeddings?

📌 Word Embeddings are numerical representations of words in a dense vector space, capturing relationships between words based on their meaning and context.

🔹 Advantages of Word Embeddings:
✔ Preserve semantic meaning.
✔ Capture word similarity and relationships.
✔ Reduce dimensionality compared to sparse representations like BoW.
✔ Enable better generalization in machine learning models.

🚀 Popular Word Embedding Techniques:
✔ Word2Vec (Developed by Google)
✔ GloVe (Global Vectors) (Developed by Stanford)

1. Word2Vec (Word to Vector)

📌 Developed by Google in 2013, Word2Vec transforms words into vectors based on the context in which they appear. It is an unsupervised learning algorithm that captures the semantic meaning of words using two main models:
✔ Continuous Bag of Words (CBOW)
✔ Skip-Gram Model

1.1 Continuous Bag of Words (CBOW)

📌 CBOW predicts a target word using its surrounding context words.

Example:
✅ Sentence: "The cat sat on the mat."
✅ Target word: "sat"
✅ Context words: ["The", "cat", "on", "the", "mat"]

📌 CBOW Training:

The model takes the context words (The, cat, on, the, mat) as input.
It predicts the target word (sat).
It adjusts word vectors to optimize predictions.

📌 CBOW Characteristics:
✔ Faster training compared to Skip-Gram.
✔ Works better for smaller datasets.
✔ Performs well when context is available.

1.2 Skip-Gram Model

📌 Skip-Gram predicts surrounding words given a target word.

Example:
✅ Sentence: "The cat sat on the mat."
✅ Target word: "sat"
✅ Predicted context words: ["The", "cat", "on", "the", "mat"]

📌 Skip-Gram Training:

The model takes the target word (sat) as input.
It predicts multiple context words.
It learns word representations by adjusting word vectors.

📌 Skip-Gram Characteristics:
✔ Works better for infrequent words.
✔ Performs well on larger datasets.
✔ Computationally expensive compared to CBOW.

1.3 Word2Vec Training in Python

We can train Word2Vec using the Gensim library.

Step 1: Install Dependencies

pip install gensim

Step 2: Import Libraries

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

Step 3: Define Corpus & Tokenize

corpus = ["The cat sat on the mat.",
          "The dog barked at the cat.",
          "The cat chased the mouse.",
          "Dogs and cats are common pets."]

# Tokenizing sentences
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

Step 4: Train Word2Vec Model

model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)

📌 Hyperparameters: ✔ vector_size=100 → Dimension of word vectors
✔ window=5 → Context window size
✔ min_count=1 → Ignores words with occurrences < 1
✔ workers=4 → Number of CPU threads

Step 5: Get Word Embeddings

print(model.wv["cat"])  # Get vector for "cat"

Step 6: Find Similar Words

print(model.wv.most_similar("cat"))  # Find words similar to "cat"

2. GloVe (Global Vectors for Word Representation)

📌 Developed by Stanford, GloVe learns word embeddings based on word co-occurrence in a large corpus.

🔹 How GloVe Works:
✔ Creates a word co-occurrence matrix, counting how often words appear together.
✔ Factorizes the matrix to produce word vectors.
✔ Captures global statistics rather than relying only on local context like Word2Vec.

2.1 Difference Between Word2Vec and GloVe

Feature	Word2Vec	GloVe
Approach	Predicts words using context (CBOW/Skip-Gram)	Factorizes word co-occurrence matrix
Captures Local/Global Info	Local context-based	Global word co-occurrence
Performance on Small Data	Good	Needs large corpus
Computational Efficiency	Faster	Slower

2.2 Using Pre-trained GloVe Vectors

📌 Stanford provides pre-trained GloVe embeddings (trained on large datasets like Wikipedia).

Step 1: Download Pre-trained GloVe Vectors

wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip

Step 2: Load GloVe Vectors

import numpy as np

# Load word embeddings
glove_dict = {}
with open("glove.6B.100d.txt", encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.array(values[1:], dtype="float32")
        glove_dict[word] = vector

# Check embedding for "cat"
print(glove_dict["cat"])

3. Applications of Word Embeddings

🚀 Word Embeddings power many NLP applications:
✔ Chatbots & Virtual Assistants – Understanding queries.
✔ Machine Translation – Capturing word relationships.
✔ Text Classification – Sentiment analysis, spam filtering.
✔ Recommendation Systems – Identifying similar content.
✔ Question-Answering Systems – Context-aware responses.

4. Comparison: Word2Vec vs. GloVe

Feature	Word2Vec	GloVe
Training Method	Predictive (CBOW, Skip-Gram)	Matrix Factorization
Captures Meaning?	Yes	Yes
Training Speed	Faster	Slower
Handles Rare Words	Better	Requires large corpus
Performance on Large Data	Performs well	Performs better

5. Key Takeaways

✅ Word Embeddings help capture semantic meaning in text.
✅ Word2Vec (CBOW, Skip-Gram) predicts words based on context.
✅ GloVe creates embeddings from word co-occurrence statistics.
✅ Both techniques power NLP applications like chatbots, sentiment analysis, and search engines.

📌 Next Steps: Want to explore how to fine-tune embeddings for your dataset? Let me know!