Open In App

How to Create a Distance Matrix in R?

Improve
Improve
Like Article
Like
Save
Share
Report

A distance matrix is a matrix that contains the distance between each pair of elements in a dataset. In R Programming Language, there are several functions available for creating a distance matrix such as dist(), daisy(), and vegdist() from the stats, cluster, and vegan packages respectively.

  • Distance metrics: A distance metric is a function that defines the distance between two elements in a dataset. Common distance metrics include Euclidean distance, Manhattan distance, and cosine similarity.
  • Categorical variables: Categorical variables are variables that can take on one of a limited set of values. When creating a distance matrix, categorical variables need to be handled differently than numerical variables.
  • Missing data: Missing data is data that is not available for certain elements in a dataset. When creating a distance matrix, missing data needs to be handled differently than non-missing data.

Euclidean Distance Matrix

Euclidean distance is a measure of the straight-line distance between two points in Euclidean space. The mathematical formula for Euclidean distance is as follows:

d\left ( p,q \right )=\sqrt{\left ( p_1-q_1 \right )^2+\left ( p_2-q_2 \right )^2+\cdots +\left ( p_n-q_n \right )^2}

Where d(p, q) is the Euclidean distance between two points, p, and q, and q1, q2, . . ., qn and p1, p2, . . . , pn are the coordinates of the points in n-dimensional space. In two-dimensional space, the Euclidean distance between two points (x1, y1) and (x2, y2) can be calculated using the following formula:

R

# Create a matrix of data
data <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8),
               nrow = 4, ncol = 2)
  
# Create an Euclidean distance matrix
euclidean_matrix <- dist(data)
  
# Print the distance matrix
print(euclidean_matrix)

                    

Output:

          1         2         3
2 2.8284271                
3 5.65685439 5.65685439        
4 8.48528137 8.48528137 8.48528137

This output is a distance matrix where the value at i, j is the Euclidean distance between rows i and j of the data matrix. The dist() function calculates the Euclidean distance by default, so you don’t need to specify the method as “euclidean” explicitly.

Gower Distance Matrix

The Gower distance is a similarity measure that can be used to compare observations that have mixed types of attributes (e.g., categorical and numerical). It is often used for clustering and classification tasks when working with datasets that have both continuous and categorical variables.

d\left ( x,y \right )=\frac{1}{p}\sum_{i=1}^{p}w_i \ast d_i\left ( x_i,y_i \right )

Where d(x, y) is the Gower distance between observations x and y, p is the number of attributes, wi is the weight for the ith attribute, and di(xi, yi) is the contribution of the ith attribute to the distance. The contribution of each attribute can be different, depending on the type of attribute. 

R

# create a sample data frame with mixed data types
df <- data.frame(
  numeric_var = c(1,2,3,4),
  categorical_var = c("A", "B", "A", "C"),
  binary_var = c(0,1,0,1)
)
  
# calculate the mean and range of the numeric variable
mean_numeric_var <- mean(df$numeric_var)
range_numeric_var <- range(df$numeric_var)
  
# normalize the numeric variable
df$numeric_var <- (df$numeric_var - mean_numeric_var) / range_numeric_var
  
# convert categorical variable to binary
df$categorical_var <- ifelse(df$categorical_var == "A", 1, 0)
  
# convert binary variable to binary
df$binary_var <- as.numeric(df$binary_var)
  
# initialize the Gower distance matrix
gower_distance_matrix <- matrix(0, nrow = nrow(df),
                                ncol = nrow(df))
  
# calculate the Gower distance between each pair of rows
for (i in 1:nrow(df)) {
  for (j in 1:nrow(df)) {
    gower_distance_matrix[i,j] <- (
      abs(df$numeric_var[i] - df$numeric_var[j]) + 
      abs(df$categorical_var[i] - df$categorical_var[j]) + 
      abs(df$binary_var[i] - df$binary_var[j])
    ) / 3
  }
}
  
# output
print(gower_distance_matrix)

                    

Output:

          [,1]      [,2]      [,3]      [,4]
[1,] 0.0000000 1.1250000 0.6666667 1.2916667
[2,] 1.1250000 0.0000000 0.8750000 0.1666667
[3,] 0.6666667 0.8750000 0.0000000 0.7083333
[4,] 1.2916667 0.1666667 0.7083333 0.0000000

In this example, the code first normalizes the numeric variable by subtracting the mean of the variable from each value and dividing it by the range of the variable. This helps to put all variables on the same scale for comparison. Next, the categorical and binary variables are converted to binary form, where each value is either 1 or 0. This is done using an “ifelse” statement for the categorical variable and a “as.numeric” conversion for the binary variable.

Finally, the Gower distance between each pair of rows is calculated and stored in a matrix. The distance is calculated as the average of the absolute differences between each variable in the two rows, divided by the number of variables. The resulting matrix can be used to determine the similarity between each pair of rows in the data frame.

Jaccard Distance Matrix

Jaccard distance, also known as the Jaccard similarity coefficient, is a measure of similarity between two sets. It is widely used in fields such as natural language processing, information retrieval, and bioinformatics. The mathematical formula for the Jaccard similarity coefficient is defined as:

J(A,B) = |A ∩ B| / |A ∪ B|

Where J(A, B) is the Jaccard similarity coefficient between sets A and B, |A| and |B| are the cardinalities of sets A and B, and |A ∩ B| is the cardinality of the intersection of A and B. To get Jaccard’s distance, we need to subtract Jaccard’s similarity from 1. 

Jaccard Distance = 1 - Jaccard Similarity Coefficient

Jaccard distance is a metric that ranges from 0 to 1, with a value of 0 indicating that the sets are identical and a value of 1 indicating that the sets have no elements in common.

R

# create a sample data frame with mixed data types
df <- data.frame(
  numeric_var = c(1,2,3,4),
  categorical_var = c("A", "B", "A", "C"),
  binary_var = c(0,1,0,1)
)
  
# convert categorical variable to binary
df$categorical_var <- ifelse(df$categorical_var == "A", 1, 0)
  
# convert binary variable to binary
df$binary_var <- as.numeric(df$binary_var)
  
# calculate the Jaccard distance 
# matrix using the dist function
jaccard_distance_matrix <- dist(df,
                                method = "binary")
  
#Print the distance matrix
print(jaccard_distance_matrix)

                    

Output:

          1         2         3
2 0.6666667                    
3 0.0000000 0.6666667          
4 0.6666667 0.0000000 0.6666667

In this example, the code converts a sample data frame with mixed data types (numeric, categorical, binary) into a binary format and calculates the Jaccard distance matrix using the dist function. The categorical and binary variables are converted to binary using ifelse and as.numeric functions, respectively. The result is the Jaccard distance matrix, which represents the pairwise distances between each row in the data frame. The result is printed and can be used for further analysis or visualization.

Manhattan Distance Matrix

Manhattan distance, also known as L1 distance, is a measure of the distance between two points in a Euclidean space. It is calculated as the sum of the absolute differences between their coordinates. Unlike Euclidean distance, which measures the “straight-line” distance between two points, Manhattan distance measures the distance between two points as if you were navigating a grid-like city, always moving only horizontally or vertically, like a taxi in Manhattan.

The mathematical formula for Manhattan distance between two points (x1, y1) and (x2, y2) in a 2-dimensional space is:

d(p,q) = |x1 - x2| + |y1 - y2|

And for n-dimensional space the formula is:

d\left ( p,q \right )=\sum_{i=1}^{n}\left| q_i-p_i\right|

Where d(p, q) is the Manhattan distance between two points p and q, and q(i) and p(i) are the coordinates of the points in n-dimensional space.
 

R

# Create a matrix of data
data <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8),
               nrow = 4, ncol = 2)
  
# Create a Manhattan distance matrix
manhattan_matrix <- dist(data,
                         method = "manhattan")
  
# Print the distance matrix
print(manhattan_matrix)

                    

Output:

  1 2 3
2 2    
3 4 2  
4 6 4 2

Canberra Distance Matrix

Canberra distance is a measure of dissimilarity between two points in a multidimensional space. It is defined as the sum of the absolute differences between the coordinates of the two points, divided by the sum of their coordinates, as follows:

d\left ( p,q \right )=\sum_{i=1}^{n}\frac{\left| x_i-y_i\right|}{\left| x_i\right|+\left| y_i\right|}

Where d(x,y) is the Canberra distance between two points x and y, and n is the number of dimensions in the space. xi and yi are the coordinates of the two points in the ith dimension. This distance measure is particularly useful when dealing with ordinal or categorical data, as it is less sensitive to the scale of the data than other distance measures such as Euclidean distance. The Canberra distance is also less sensitive to the presence of outliers than other distance measures such as the Manhattan distance.

R

# Create a matrix of data
data <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8),
               nrow = 4, ncol = 2)
  
# Create a Canberra distance matrix
canberra_matrix <- dist(data,
                        method = "canberra")
  
# Print the distance matrix
print(canberra_matrix)

                    

Output:

          1         2         3
2 0.4242424                    
3 0.6666667 0.2769231          
4 0.8307692 0.4761905 0.2095238


Last Updated : 05 Feb, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads