t-SNE for High-Dimensional Data

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a powerful technique for dimensionality reduction and visualization of high-dimensional data. Unlike PCA, which is a linear method, t-SNE is a non-linear technique that is particularly useful for visualizing datasets in 2D or 3D space while preserving the local structure of the data. This means that t-SNE is capable of capturing complex relationships and clusters within data that may not be obvious in a lower-dimensional projection. Let’s break down the t-SNE algorithm step by step in a detailed manner.

Overview of t-SNE

t-SNE is primarily used to reduce the dimensions of high-dimensional datasets into a 2D or 3D space for the purpose of visualization. It is often used in machine learning to understand the structure of the data or to visually inspect clusters of data points. The goal of t-SNE is to keep similar points close together and dissimilar points farther apart in the lower-dimensional space. It uses probability distributions to model similarity between points and minimizes the divergence between these distributions.

Step 1: Compute Pairwise Similarities in High-Dimensional Space

The first step in t-SNE involves calculating the pairwise similarity of the data points in the high-dimensional space. For each pair of points xix_i and xjx_j, t-SNE computes the similarity between them based on a Gaussian distribution.

Conditional Probability: The similarity between two points xix_i and xjx_j is computed using a conditional probability based on a Gaussian distribution centered at xix_i: Pij=exp(−∥xi−xj∥2/2σi2)∑k≠iexp(−∥xi−xk∥2/2σi2)P_{ij} = \frac{exp(-\|x_i – x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} exp(-\|x_i – x_k\|^2 / 2\sigma_i^2)} Where:
- PijP_{ij} is the probability that point xjx_j is a neighbor of xix_i
- σi2\sigma_i^2 is a variance parameter that controls the width of the Gaussian distribution
- The sum is taken over all points except xix_i.
Symmetry: The probability PijP_{ij} is symmetric, meaning that the similarity between points ii and jj is the same as the similarity between points jj and ii. Thus, Pij=PjiP_{ij} = P_{ji}.

The result of this step is a probability distribution for each data point over the rest of the data points, representing the likelihood that a point is a neighbor of another point.

Step 2: Define the Similarity in the Lower-Dimensional Space

Once the high-dimensional similarities are calculated, the next step is to define the similarity between data points in the low-dimensional space (usually 2D or 3D) where we aim to project the data.

Student’s t-Distribution: In the lower-dimensional space, t-SNE uses a t-distribution with one degree of freedom (also known as the Cauchy distribution) to model the pairwise similarity between points. The t-distribution is used because it has heavier tails compared to a Gaussian distribution, which helps in better modeling the distances in low-dimensional space and preventing crowding problems. Qij=(1+∥yi−yj∥2)−1∑k≠l(1+∥yi−yk∥2)−1Q_{ij} = \frac{(1 + \|y_i – y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_i – y_k\|^2)^{-1}} Where:
- QijQ_{ij} is the probability that point yjy_j is a neighbor of yiy_i in the low-dimensional space
- ∥yi−yj∥2\|y_i – y_j\|^2 is the squared Euclidean distance between points yiy_i and yjy_j in the low-dimensional space.

This creates a similar probability distribution to the one in the high-dimensional space but in the lower-dimensional space.

Step 3: Minimize the Kullback-Leibler (KL) Divergence

At this point, we have two sets of probability distributions: one from the high-dimensional space (PijP_{ij}) and one from the low-dimensional space (QijQ_{ij}). The next step is to minimize the difference between these two distributions.

Kullback-Leibler (KL) Divergence: t-SNE minimizes the KL divergence between the two distributions. The KL divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. DKL(P∣∣Q)=∑i∑jPijlog⁡(PijQij)D_{KL}(P || Q) = \sum_{i} \sum_{j} P_{ij} \log \left( \frac{P_{ij}}{Q_{ij}} \right) Where:
- DKL(P∣∣Q)D_{KL}(P || Q) is the KL divergence between the high-dimensional probability distribution PijP_{ij} and the low-dimensional probability distribution QijQ_{ij}.
- The summation is performed over all pairs of points.

The KL divergence is minimized using gradient descent. The idea is to adjust the positions of the points in the low-dimensional space such that the pairwise similarities in this new space (QijQ_{ij}) are as close as possible to the pairwise similarities in the high-dimensional space (PijP_{ij}).

Step 4: Gradient Descent Optimization

To minimize the KL divergence, t-SNE uses gradient descent, an optimization technique to iteratively update the positions of the points in the low-dimensional space.

Gradient of the KL Divergence: The gradient of the KL divergence with respect to the low-dimensional coordinates is computed. This gradient tells us the direction and magnitude of changes needed to minimize the KL divergence.
Update Positions: Using the gradient, we update the positions of the points in the lower-dimensional space in the direction that reduces the KL divergence. The update is performed for each point in the low-dimensional space: yi(t+1)=yi(t)−η∂DKL∂yiy_i(t+1) = y_i(t) – \eta \frac{\partial D_{KL}}{\partial y_i} Where:
- yi(t)y_i(t) is the current position of point ii in the low-dimensional space at time step tt
- η\eta is the learning rate (a small positive value)
- ∂DKL∂yi\frac{\partial D_{KL}}{\partial y_i} is the gradient of the KL divergence with respect to the position of point ii
Iterative Process: The gradient descent is repeated for a number of iterations until convergence, meaning the positions of the points in the low-dimensional space no longer change significantly or the KL divergence reaches a minimum value.

Step 5: Visualizing the Results

Once the optimization process has converged, we are left with the low-dimensional coordinates of the data points, typically in 2D or 3D. These coordinates represent the structure of the data, with similar points being placed close to each other and dissimilar points being far apart.

At this point, we can visualize the high-dimensional data in a 2D or 3D plot. t-SNE is particularly useful in identifying clusters, outliers, or patterns in the data that may not be easily seen in the high-dimensional space.

Advantages of t-SNE

Non-linear Relationships: t-SNE captures complex non-linear relationships in the data, unlike linear methods such as PCA.
Effective for Visualization: It is a powerful tool for visualizing high-dimensional data in 2D or 3D, helping to identify clusters or patterns.
Preserves Local Structure: t-SNE does a great job of preserving local structures (e.g., clusters of similar points) in the low-dimensional space.

Limitations of t-SNE

Computationally Expensive: t-SNE is computationally expensive, especially for large datasets, because it requires calculating pairwise distances and performing optimization.
Non-Deterministic: The results of t-SNE can vary depending on the initialization of the points in the low-dimensional space and the parameters chosen (e.g., perplexity and learning rate).
Global Structure Loss: t-SNE focuses on preserving local relationships, and therefore, it may distort the global structure of the data (e.g., the distances between clusters may not be accurate).

Key Parameters in t-SNE

Perplexity: This parameter controls the balance between local and global aspects of the data. It determines how many nearest neighbors should be considered when computing the similarity. Typical values range from 5 to 50.
Learning Rate: This controls how fast the gradient descent optimization should proceed. A learning rate that is too high or too low can cause the algorithm to converge to a poor solution.
Number of Iterations: The number of times the optimization should be run. More iterations usually result in a better representation of the data.

Example of t-SNE in Practice

Let’s say you have a dataset with 1000 high-dimensional features (e.g., 1000 pixel values from an image dataset). You want to visualize the data in 2D:

Standardize the data (if necessary).
Apply t-SNE to reduce the dimensionality to 2D using the steps outlined above.
Visualize the resulting 2D data using a scatter plot, where each point corresponds to a data point in the original high-dimensional space, and points that are close to each other in the plot are similar to each other in the original data.

In summary, t-SNE is an effective tool for visualizing and understanding high-dimensional data by preserving the local structure in a lower-dimensional space. Although it is computationally intensive, it is widely used in fields like machine learning, bioinformatics, and image processing for exploratory data analysis and visualization.