Open In App

How to Calculate Jaccard Similarity in R?

Last Updated : 04 Jan, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

Jaccard Similarity also called as Jaccard Index or Jaccard Coefficient is a simple measure to represent the similarity between data samples. The similarity is computed as the ratio of the length of the intersection within data samples to the length of the union of the data samples. 

It is represented as – 

J(A, B) =  |A Õˆ B| / |A U B|

It is used to find the similarity or overlap between the two binary vectors or numeric vectors or strings. It can be represented as J. There is also a closely related term associated with Jaccard Similarity which is called Jaccard Dissimilarity or Jaccard Distance. Jaccard Distance is a measure of dissimilarity between data samples and can be represented as (1 – J)  where J is Jaccard Similarity.

Common Applications of Jaccard Similarity:

Jaccard Similarity is used in multiple data science and machine learning applications. Some of the frequent use cases encountered in real life include :

  • Text mining: finding the similarity between two text documents based on the number of terms used in both documents.
  • E-Commerce: finding similar customers via their purchase history from a sales database of thousands of customers and millions of items.
  • Recommendation Systems: Finding similar customers based on ratings and reviews e.g., Movie recommendation algorithms, Product recommendation, diet recommendation, matrimony recommendations, etc.

Jaccard Similarity Formula and Concepts:

Jaccard Similarity value ranges from 0 to 1. The higher the number, the more similar are the datasets with each other. Although it is easy to interpret but is extremely sensitive to smaller sample datasets and can give erroneous results hence one needs to be careful while comprehending results.

Jaccard Similarity for Numeric Sets:

Jaccard Similarity (J) = ( count of common elements in both sets) / ( count of elements in first set + count of elements in second set – count of common elements in both sets)

Where (count of elements in first set + count of elements in the second set – count of common elements in both sets) = count of total unique elements in both the sets.

Considering A and B as two sets, it can be represented in symbolic form as

J(A, B) =  |A Õˆ B| / |A U B| = |A Õˆ B| / |A| + |B| - |A Õˆ B|

Example:

In this example, we will be considering A and B be two sets respectively where Set A = { 5, 10, 15, 20, 25, 30, 35,40, 45, 50} and Set B = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100),A Õˆ B = { 10, 20, 30, 40, 50 }    i.e., |A Õˆ B| = 5,A U B = {5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100 }  i.e., |A U B| = 15,Therefore, J(A, B) =  |A Õˆ B| / |A U B|  = 5 / 15 = 0.33333 and stating it in R programming language.

R




# Set A - numeric vector    
SetA <- c(5,10,15,20,25,30,35,40,45,50)
  
# Set B - numeric vector
SetB <- c(10,20,30,40,50,60,70,80,90,100)
  
# Function for computing Jaccard Similarity
jaccard_similarity <- function(A, B) {
  intersection = length(intersect(A, B))
  union = length(A) + length(B) - intersection
  return (intersection/union)
}
  
# Jaccard Similarity between sets, A and B
Jaccard_Similarity <- jaccard_similarity(SetA,SetB)
Jaccard_Similarity
  
# Jaccard Dissimilarity/Distance between sets, A and B 
Jaccard_Distance = 1 - Jaccard_Similarity
Jaccard_Distance


Output

Jaccard Similarity for Binary Sets

Considering A and B as two binary vectors,

Jaccard Similarity (J) = (number of observations which are 1 in both the vectors) / (number of observations which are 1 in both the vectors + number of observations which are 0 for A and 1 for B + number of observations which are 1 for A and 0 for B)

The symbolic form becomes 

 J(A, B) =  a_11 / (a_11 + b_01 + c_10)

Where a_11 = observations being 1 in both vectors

  • b_01 = observations being 0 in A and 1 in B vector
  • c_10 = observations being 1 in A and 0 in B vector
  • d_00 = observations being 0 in both vectors ( is not required for computation of Jaccard Similarity)

Example:

Consider a grocery store selling multiple products wherein the shop owner is interested in finding out the similarity between two customers on the basis of purchases made.  Here 1 indicates the product that has been purchased by the two customers and 0 indicates that the product was not purchased by those two customers.

 

Product1

Product2

Product3

Product4

Product5

Product6

Product7

Product8

Product9

Product10

Customer1

0

1

0

0

0

1

0

0

1

1

Customer2

0

0

1

0

0

0

0

0

1

1

R




# Install packages qvalue and jaccard and load
# the library
library(qvalue)
library(jaccard)
  
# Binary vectors A and B depicting purchase of 
# items by customers
Binary_A <- c(0,1,0,0,0,1,0,0,1,1)
Binary_B <- c(0,0,1,0,0,0,0,0,1,1)
  
# Computing jaccard similarity between 2 binary 
# vectors A and B
jaccard(Binary_A,Binary_B)
  
# Computing jaccard distance between 2 binary 
# vectors A and B
Jaccard_distance <- 1 - jaccard(Binary_A,Binary_B)
Jaccard_distance


Output

Jaccard Similarity for Sets with strings

Jaccard Similarity (J) = (number of matching strings present in both sets) / (number of strings in either of the set)

Considering A and B as two sets, it can be represented in symbolic form as

J(A, B) =  |A Õˆ B| / |A U B|

Example :

Let A and B be two sets of strings where

Set A = { ‘John’, ‘is’, ’going’,’ to’, ’the’, ’market’, ’today’, ’to’, ’buy’, ’cake’} and

Set B = {‘Tim’, ‘is’, ‘at’, ’the’,’ shop’, ’already’, ’for’, ‘buying’, ‘two’, ‘cakes’}

Find Jaccard Similarity between the two sets.

R




# Install package "bayesbio" and load the library
library(bayesbio)
  
# Two strings "String_A" and "String_B" as sets
String_A < - c("John", "is", "going", "to", "the",
               "market", "today", "to", "buy", "cake")
String_B < - c("Tim", "is", "at", "the", "shop",
               "already", "for", "buying", "two", "cakes")
  
# Computing Jaccard similarity between strings word 
# by word
# Note - value 0 denotes complete match and 1 denotes 
# no match as per "stringdist" function
stringdist(String_A, String_B, method='jaccard')
  
# Computing Jaccard similarity between strings overall
jaccardSets(String_A, String_B)
  
# Computing Jaccard distance
jaccard_distance = 1 - jaccardSets(String_A, String_B)
jaccard_distance


Output



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads