Dimensionality Reduction

Many machine learning problems have a lot of features. Fortunately is real world problems we have the ability to remove features. For instance in the MNIST dataset, which has a number of handwritten digits, the pixels in the border are almost white for every one of the images and this can be removed. At the same time two consecutive pixels may be highly correlated and joining them together (by their mean value) would not cause much loss in information.

The curse of Dimensionality

We as living things love are so used to living in 3D that our imagination just can not imagine high dimensional spaces, including 4D

The main approaches of Dimensionality Reduction

A. Projection

In most real world problems, training instances are not spread out uniformly across all dimensions. Many features are constant while some may be highly correlated.This, hence makes all the training instances lie in a smaller low dimensional subspace. Take a 3D instance and if you project its depth to a 2D surface you have reduced its dimension. However this can not be done in every case. For example take a look at the famous swiss roll dimensionality reduction as shown below. Here you can see after dimensionality reduction there seems to be a loss of a lot of information that could have been used to distinguish the individual instances.

B. Manifold Learning

The swiss roll problems as we saw above is an example of 2-D manifold. In simple words, a 2D manifold is a 2D shape that can be twisted,stretched,etc into a higher dimensional space. This means a d-manifold is part of a n-dimensional space (where d<n). Many dimensionality reduction algorithms work on modeling the manifold on which the training instances lie. This is called manifold learning and relies on the manifold assumption that states that most real world problems of high dimensional datasets lie close to a lower dimension manifold. This assumption is accompanied by another assumption that assumes the task at hand (classification or regression) will be simpler if expressed in a lower dimension.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm.

A. Preserving the Variance

Before you project the training set into a lower level hyperplane, you should be able to pick the right hyperplane. There can exist many different hyperplanes or axes, but you must choose the one that preserves the most variance. It is logical to pick the hyperplane that has the highest variance as it loses the least amount of information.

B. Principal Components

PCA identifies the axis that accounts for the largest amount of variance in the training set, If the data was high dimensional PCA would find the axes that are orthogonal to each 1st,2nd,…,nth axes.


Unit vector that defines that the ‘i’th axis is called the ‘i’th principal component


To find the principal components there is a factorization method called SVD. This decomposes a matrix ‘X’ into a multiplication of 3 smaller matrices, out of which one contains the principal components

In the given equation for SVD the matrix ‘V’ contains the principal components

Projecting down to D-dimensions

Once you have identified all the principal components, you can reduce the dimensionality of the dataset down to ‘d’ dimensions by projecting it onto a hyperplane defined by the first ‘d’ principal component. To project the training set onto a hyperplane all you have to do is multiply the original matrix with a matrix containing the first ‘d’ principal components.

Explained Variance Ratio

The explained variance ratio is accessed by the variable explained_variance_ratio_ in sklearn. It indicates proportions of the dataset’s variances that lie along the axis of each principal component.

Choosing the right dimensions

Instead of choosing an arbitrary number of dimensions to reduce down to. It is better to choose the number of dimensions that add up to a sufficiently higher large portion of the variance. However if you want to reduce the dimensions for data visualizations it is best to reduce the data into 2 or 3 dimensions.

PCA for compression

After dimensionality reduction the datasets have a much smaller size. It is also possible to decompress the reduced dataset. This won’t give you the original data but would be highly close to the original data in terms of information. The MSE between the original data and the reconstructed data is called the reconstruction error.

Randomized PCA

You can set the svd_solver hyperparameter to 'randomized’ so that PCA quickly estimates the first ‘d’ principal components. By default the svd_solver hyperparameter is set to 'auto'. This means that it uses randomized PCA only when the value of m or n is greater than 500.

Incremental PCA

A problem with PCA is that it requires the whole training set to fit in the memory in order for the algorithm to work. However Incremental PCA (IPCA) algorithms have been introduced that divides the dataset into mini-batches to work with.

Kernel PCA

The Kernel trick is basically a mathematical technique in which instances are mapped into a high dimensional space (feature space) enabling non-linear regression and classification with SVM

Selecting a kernel and tuning the hyperparameters

Kernel PCA (kPCA) is an unsupervised learning algorithm and hence there is no obvious performance measure. However dimensionality reduction is often a preparation step for supervised learning tasks. Hence you can use grid search to select the best hyperparameters and kernels

Locally Linear Embedding

This is a very powerful non-linear dimensionality reduction technique. It is a manifold learning technique that does not rely on projections. LLE works by first measuring how many instances linearly related to its closest neighbor and then looks for a lower dimensional representation of the training set where these relationships are preserved the most