Multidimensional data analysis in Python
Multi-dimensional data analysis is an informative analysis of data which takes many relationships into account. Let’s shed light on some basic techniques used for analysing multidimensional/multivariate data using open source libraries written in Python.
Find the link for data used for illustration from here.
Following code is used to read 2D tabular data from zoo_data.csv.
Note: The type of data we have here is typically categorical. The techniques used in this case study for categorical data analysis are very basic ones which are simple to understand, interpret and implement. These include cluster analysis, correlation analysis, PCA(Principal component analysis) and EDA(Exploratory Data Analysis) analysis.
As the data we have is based on the characteristics of different types of animals, we can classify animals into different groups(clusters) or subgroups using some well known clustering techniques namely KMeans clustering, DBscan, Hierarchical clustering & KNN(K-Nearest Neighbours) clustering. For sake of simplicity, KMeans clustering ought to be a better option in this case. Clustering data using Kmeans clustering technique can be achieved using KMeans module of cluster class of sklearn library as follows:
Here, overall cluster inertia comes out to be 119.70392382759556. This value is stored in kmeans.inertia_ variable.
To perform EDA analysis, we need to reduce dimensionality of multivariate data we have to trivariate/bivariate(2D/3D) data. We can achieve this task using PCA(Principal Component Analysis).
For more information refer to https://www.geeksforgeeks.org/dimensionality-reduction/
PCA can be carried out using PCA module of class decomposition of library sklearn as follows:
Data output above represents reduced trivariate(3D) data on which we can perform EDA analysis.
Note: Reduced Data produced by PCA can be used indirectly for performing various analysis but is not directly human interpretable.
Scatter plot is a 2D/3D plot which is helpful in analysis of various clusters in 2D/3D data.
Scatter plot of 3D reduced data we produced earlier can be plotted as follows:
The code below is a Pythonic code which generates an array of colors(where number of colors are approximately equal to number of clusters) sorted in order of their hue, value and saturation values. Here each color is associated with a single cluster and will be used to denote an animal as a 3D point while plotting it in a 3D plot/space(Scatter Plot in this case).
The code below is a pythonic code which generates a 3D scatter plot where each data point has a color related to its corresponding cluster.
Closely analysing the scatter plot can lead to hypothesis that the clusters formed using the initial data doesn’t have good enough explanatory power. To solve this issue, we need to bring down our set of features to a more useful set of features using which we can generate useful clusters. One way of producing such a set of features is to carry out correlation analysis. This can be done by plotting heatmaps and trisurface plots as follows:
Following code is used to generate a trisurface plot of correlation matrix by making a list of tuples where a tuple contains coordinates and correlation value in order of animal names.
Pseudocode for above explanation:
# PseudoCode tuple -> (position_in_dataframe(feature1), position_in_dataframe(feature2), correlation(feature1, feature2))
Code for generating trisurface plot for correlation matrix:
Using heatmap and trisurface plot, we can make some inferences on how to select a smaller set of features used for performing cluster analysis. Generally, feature pairs with extreme correlation values carry high explanatory power and can be used for further analysis.
In this case, looking at both the plots, we arrive at a rational list of 7 features:[“milk”, “eggs”, “hair”, “toothed”, “feathers”, “breathes”, “aquatic”]
Running cluster analysis again on the subsetted feature set, we can generate a scatter plot with better inference on how to spread different animals among various groups.
We observe a reduced overall inertia of 14.479670329670329, which is indeed a lot less from the initial inertia.