How to reduce dimensionality on Sparse Matrix in Python?

A matrix usually consists of a combination of zeros and non-zeros. When a matrix is comprised mostly of zeros, then such a matrix is called a sparse matrix. A matrix that consists of maximum non-zero numbers, such a matrix is called a dense matrix. Sparse matrix finds its application in high dimensional Machine learning and deep learning problems. In other words, when a matrix has many of its coefficients as zero, such a matrix is said to be sparse.

The common area where we come across such sparse dimensionality problems is

Natural Language Processing – It is obvious that most of the vector elements of the document will be 0s in language models
Computer Vision – Sometimes an image can be occupied by similar color (eg, white which can be a background) that doesn’t give us any useful information.

In such cases, we cannot afford to have a matrix of the large dimensional matrix, as it can increase the time and space complexity of the problem, so it is recommended to reduce the dimensionality of the sparse matrix. In this article let us discuss the implementation of how to reduce the dimensionality of the sparse matrix in python

The dimensionality of the sparse matrix can be reduced by first representing the dense matrix as a Compressed sparse row representation in which the sparse matrix is represented using three one-dimensional arrays for the non-zero values, the extents of the rows, and the column indexes. Then, by using scikit-learn’s TruncatedSVD, it is possible to reduce the dimensionality of the sparse matrix.

Example:

First load the inbuilt digits dataset from the scikit-learn package, Standardize each data point using standardscaler. Represent the Standardized matrix in its sparse form using csr_matrix as shown. Now import the TruncatedSVD from sklearn and specify the no. of dimensions required in the final output Finally check for the shape of the reduced matrix

Python3

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import TruncatedSVD

from scipy.sparse import csr_matrix

from sklearn import datasets

from numpy import count_nonzero
 
# load the inbuilt digits dataset

digits = datasets.load_digits()
 
print(digits.data)
 
# shape of the dense matrix

print(digits.data.shape)
 
# standardizing the data points

X = StandardScaler().fit_transform(digits.data)

print(X)
 
# representing in CSR form

X_sparse = csr_matrix(X)

print(X_sparse)
 
# specify the no of output features

tsvd = TruncatedSVD(n_components=10)
 
# apply the truncatedSVD function

X_sparse_tsvd = tsvd.fit(X_sparse).transform(X_sparse)

print(X_sparse_tsvd)
 
# shape of the reduced matrix

print(X_sparse_tsvd.shape)

Output:

Code:

Let us cross verify the original dimension and transformed dimension

Python3

print("Original number of features:", X.shape[1])

print("Reduced number of features:", X_sparse_tsvd.shape[1])

Output:

Article Tags :

Python

Python-numpy

Python-scipy