# Hierarchical Clustering in R Programming

**Hierarchical clustering**** in R Programming Language** is an Unsupervised non-linear algorithm in which clusters are created such that they have a hierarchy(or a pre-determined ordering). For example, consider a family of up to three generations. A grandfather and mother have their children that become father and mother of their children. So, they all are grouped together to the same family i.e they form a hierarchy.

## R – Hierarchical Clustering

Hierarchical clustering is of two types:

**Agglomerative Hierarchical clustering:**It starts at individual leaves and successfully merges clusters together. Its a Bottom-up approach.**Divisive Hierarchical clustering:**It starts at the root and recursively split the clusters. It’s a top-down approach.

### Theory:

In hierarchical clustering, Objects are categorized into a hierarchy similar to a tree-shaped structure which is used to interpret hierarchical clustering models. The algorithm is as follows:

- Make each data point in a single point cluster that forms
**N**clusters. - Take the two closest data points and make them one cluster that forms
**N-1**clusters. - Take the two closest clusters and make them one cluster that forms
**N-2**clusters. - Repeat steps 3 until there is only one cluster.

Dendrogram is a hierarchy of clusters in which distances are converted into heights. It clusters **n** units or objects each with **p** feature into smaller groups. Units in the same cluster are joined by a horizontal line. The leaves at the bottom represent individual units. It provides a visual representation of clusters.**Thumb Rule:** Largest vertical distance which doesn’t cut any horizontal line defines the optimal number of clusters.

## The Dataset

**mtcars**(motor trend car road test) comprise fuel consumption, performance, and 10 aspects of automobile design for 32 automobiles. It comes pre-installed with dplyr package in R.

## R

`# Installing the package` `install.packages` `(` `"dplyr"` `)` ` ` `# Loading package` `library` `(dplyr)` ` ` `# Summary of dataset in package` `head` `(mtcars)` |

**Output:**

## Performing Hierarchical clustering on Dataset

Using Hierarchical Clustering algorithm on the dataset using **hclust()** which is pre-installed in stats package when R is installed.

## R

`# Finding distance matrix` `distance_mat <- ` `dist` `(mtcars, method = ` `'euclidean'` `)` `distance_mat` `# Fitting Hierarchical clustering Model` `# to training dataset` `set.seed` `(240) ` `# Setting seed` `Hierar_cl <- ` `hclust` `(distance_mat, method = ` `"average"` `)` `Hierar_cl` `# Plotting dendrogram` `plot` `(Hierar_cl)` `# Choosing no. of clusters` `# Cutting tree by height` `abline` `(h = 110, col = ` `"green"` `)` `# Cutting tree by no. of clusters` `fit <- ` `cutree` `(Hierar_cl, k = 3 )` `fit` `table` `(fit)` `rect.hclust` `(Hierar_cl, k = 3, border = ` `"green"` `)` |

**Output:**

**Distance matrix:**

- The values are shown as per the distance matrix calculation with the method as euclidean.
**Model Hierar_cl:**

- In the model, the cluster method is average, distance is euclidean and no. of objects are 32.
**Plot dendrogram:**

- The plot dendrogram is shown with x-axis as distance matrix and y-axis as height.
**Cutted tree:**

- So, Tree is cut where k = 3 and each category represents its number of clusters.
**Plotting dendrogram after cutting:**

- The plot denotes dendrogram after being cut. The green lines show the number of clusters as per the thumb rule.