Open In App

How to sample large database and implement K-means and Knn in R

Last Updated : 02 Feb, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Sample large database becomes essential to reduce the amount of data for better and faster prediction. The goal is to extract a representative subset of data from a larger dataset for analysis, as it might be impractical or time-consuming to analyze the entire dataset. Whereas R is an open-source programming language and software environment for data analysis and statistical computing. In this article, we will learn how to sample a large dataset and implement machine learning algorithms like K-Nearest Neighbors (KNN) for classification and K-means for clustering, using the R programming language.

What is Sampling?

Sampling is the process of selecting a subset according to the requirements known as a sample, from a larger population (Bigger Dataset) for analyzing the Large Database based on the characteristics observed within the sample. It is a fundamental method in statistics and data analysis, especially when analyzing large datasets where analysis of the entire dataset seems impractical and requires a lot of time. There three major techniques are used to sample large databases,

  1. Random Sampling: Here every individual or element in the population has an equal chance of being included in the sample. R’s dplyr package, you can use the sample_n() function.
  2. Stratified Sampling: In this, the population is divided into distinct subgroups based on certain characteristics. Then, random samples are independently taken from each subgroups.
  3. Systematic Sampling: This type of sampling involves selecting every nth item from the population after randomly selecting a starting point.

Steps to Sample large Dataset

  1. Identify Population Set : Before beginning the sampling process, it is imperative to clearly define the objective and select the best suitable Population set from where sample set will be created.
  2. Select a Suitable Method: Based on the stated purpose and characteristics of the dataset, determine which sampling method will best suit your needs. This may include simple random sampling, stratified sampling, or systematic sampling.
  3. Determine the Optimal Sample Size: Consider both statistical requirements and desired precision when determining the appropriate sample size for your project.
  4. Implement the Sampling Method: Extract your sample from the large dataset using relevant functions in your preferred data analysis tool, or by writing SQL queries.
  5. Validate the Sample: To ensure the sample accurately represents the larger population, thoroughly check key characteristics. Perform any necessary analyses using the sample data if required.

Implementation of K-Means in R

K-means is an unsupervised machine learning algorithm used for clustering. K-Means clustering is used to find intrinsic groups within the unlabeled dataset and draw inferences from them. It is based on centroid-based clustering.

Centroid – A centroid is a data point at the centre of a cluster. In centroid-based clustering, clusters are represented by a centroid. It is an iterative algorithm in which the notion of similarity is derived by how close a data point is to the centroid of the cluster. K-Means clustering works as follows:- The K-Means clustering algorithm uses an iterative procedure to deliver a final result. The algorithm requires number of clusters K and the data set as input. The data set is a collection of features for each data point. The algorithm starts with initial estimates for the K centroids.

It partitions a dataset into K clusters based on the similarity of data points, where K is a predefined number. K-means works with an unlabeled dataset and aims to group data points into clusters such that points within a cluster are more similar to each other than to points in other clusters.

Install and Load Packages

  • dplyr: The dplyr package is a powerful and popular package for data manipulation and transformation in R. It provides a set of functions that allows efficient manipulation of data frames and tibbles.
  • cluster: The cluster package consists of Gaussian mixture models, k-means, mini-batch-kmeans, k-medoids and affinity propagation clustering algorithms with the option to plot, validate, predict (new data) and find the optimal number of clusters.
  • ggfortify: ggfortify is an R package that enhances the visualization of statistical models using ggplot2. This package provides interface for producing plots for various statistical models, making it easier to explore and interpret the results.

R




# Installing Packages
install.packages(c("dplyr", "ClusterR", "ggfortify"))
 
# Loading Packages
library(dplyr)
library(ClusterR)
library(ggfortify)


Load Dataset

Here we are using Kaggle’s insurance dataset. It has columns as “age, sex, bmi, children, smoker, region, charges”. Firstly we have defined the path for the file and then we are accessing the file and storing it in a variable.

Dataset Link: US Health Insurance Dataset.

R




file_path <- file.path("C:/Users/subha/Downloads", "ushealth.csv")
df <- read.csv(file_path)
head(df)


Output:

  age    sex    bmi children smoker    region   charges
1 19 female 27.900 0 yes southwest 16884.924
2 18 male 33.770 1 no southeast 1725.552
3 28 male 33.000 3 no southeast 4449.462
4 33 male 22.705 0 no northwest 21984.471
5 32 male 28.880 0 no northwest 3866.855
6 31 female 25.740 0 no southeast 3756.622

Sample the Dataset

The dataset contains 1339 rows, here we will use sampling to learn how we can apply it in large dataset. let’s sample the data and make it 100 rows only.

Syntax:

sample(x, size, replace = FALSE, prob = NULL)

x: Data frame to sample from.

size: The number of random samples to draw.

replace: Logical. If TRUE, sampling is done with replacement If FALSE, sampling is done without replacement.

prob: A vector of probability weights for obtaining the elements of the vector being sampled. If not specified, the elements are equally likely to be selected.

R




set.seed(123)  # Set seed for reproducibility
 
# Data Sampling
sampled_df <- df %>%
  sample_n(size = 100, replace = FALSE)


Model Fitting

R




# Choosing only relevant columns
df_1 <- sampled_df[, c("age", "bmi")]
 
# Fitting K-Means Clustering Model
set.seed(240)  # Setting seed for reproducibility
kmeans.re <- kmeans(df_1, centers = 3, nstart = 20)


set.seed(240): Sets the random seed to 240. This ensures that if we run the same code multiple times, we will get the same results

  • kmeans(): This function is used to perform K-Means clustering.
  • df_1: The dataset on which clustering is performed.
  • centers = 3: Specifies the number of clusters you want to form.
  • nstart = 20: The number of times the algorithm is run with different initial cluster centers. The final result is the best solution obtained across all runs.

Cluster Table

R




# Confusion Matrix (not applicable for K-Means, use table(kmeans.re$cluster))
cm <- table(kmeans.re$cluster)
print("Clusters:")
print(cm)


Output:

[1] "Clusters:"

1 2 3
42 28 30

The above table shows how datapoints are assigned to the clusters. Here cluster 1 has 42 datapoints to it and similarly cluster 2 has 28 and cluster 3 has 30.

Model Visualization

R




# Model Visualization
clusplot(df_1, kmeans.re$cluster, lines = 0, shade = TRUE, color = TRUE,
         labels = 2, plotchar = FALSE, span = TRUE, main = "Cluster data",
         xlab = 'Age', ylab = 'BMI')


Output:

Rplot

We can observe that 3 clusters have been plotted and cluster 2 has minimal overlap with cluster 3.

R




# Visualization using autoplot
autoplot(stats::kmeans(df_1, centers = 3), data = df_1)


Output:

Rplot05

Three clusters are shown in the above plot where Orange colored points belong to Cluster1 and Green colored ones to Cluster2 and rest are from cluster3.

Implementation of KNN in R

K-Nearest Neighbor or KNN is a Supervised machine learning algorithm used for classification. KNN doesn’t make any assumption about underlying data or its distribution. It is one of the simplest and most widely used algorithms which depends on its k value(Neighbors). It is used for both classification and regression. It predicts a target variable using one or multiple independent variables. kNN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a good suite category by using K-NN algorithm. This algorithm is also known as a “lazy learner” as it only stores the data in the training phase and does not perform any calculation. KNN’s applications are there in healthcare industry, finance sector, etc.

Install and Load Packages

The libraires we will be using for this KNN implementation are:

  • dplyr: The dplyr package is a powerful and popular package for data manipulation and transformation in R. It provides a set of functions that allows efficient manipulation of data frames and tibbles.
  • ggplot2: ggplot2 is an open-source data visualization package used in R to create graphics declaratively .
  • class: This library is used to return the class attribute of an R object in R language. It has it’s use in various functions for classification, including k-nearest neighbour, Learning Vector Quantization and Self-Organizing Maps.

R




# Installing Packages
install.packages(c("dplyr", "cluster", "ggfortify"))
 
# Loading Packages
library(dplyr)
library(cluster)
library(ggfortify)
# Install and load the class package
install.packages("class")
library(class)


Load Dataset and explore

Here we are using Breast Cancer Wisconsin (Diagnostic) dataset which freely available on the internet.

Dataset Link: Breast Cancer Wisconsin (Diagnostic)

R




file_path <- file.path("C:/Users/subha/Downloads", "wdbc.csv") #File path
wdbc <- read.csv(file_path) #reading the dataset
head(wdbc,2)


Output:

   X842302 M X17.99 X10.38 X122.8 X1001 X0.1184 X0.2776 X0.3001 X0.1471 X0.2419 X0.07871
1 842517 M 20.57 17.77 132.9 1326 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667
2 84300903 M 19.69 21.25 130.0 1203 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999
X1.095 X0.9053 X8.589 X153.4 X0.006399 X0.04904 X0.05373 X0.01587 X0.03003 X0.006193
1 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.01860 0.01340 0.01389 0.003532
2 0.7456 0.7869 4.585 94.03 0.006150 0.04006 0.03832 0.02058 0.02250 0.004571
X25.38 X17.33 X184.6 X2019 X0.1622 X0.6656 X0.7119 X0.2654 X0.4601 X0.1189
1 24.99 23.41 158.8 1956 0.1238 0.1866 0.2416 0.186 0.2750 0.08902
2 23.57 25.53 152.5 1709 0.1444 0.4245 0.4504 0.243 0.3613 0.08758

The dataset consists of 568 rows and 32 columns.

If you observe carefully the second column or the second variable has categorical values and it is our target variable. Let’s exclude the first column from the dataset for betterment of our analysis.

R




#removes first column
wdbc <- wdbc[,-1]


Sample the Dataset

The dataset contains 568 rows, here we will use sampling to learn how we can apply it in large dataset. let’s sample the data and make it 200 rows only.

Syntax:

sample(x, size, replace = FALSE, prob = NULL)

x: Data frame to sample from.

size: The number of random samples to draw.

replace: Logical. If TRUE, sampling is done with replacement If FALSE, sampling is done without replacement.

prob: A vector of probability weights for obtaining the elements of the vector being sampled. If not specified, the elements are equally likely to be selected.

R




set.seed(123)  # Set seed for reproducibility
 
# Data Sampling
wdbc_sampled <- wdbc%>%
  sample_n(size = 200, replace = FALSE)


Data Normalization

Let’s check how the data values varies in our dataset. Let’s check for column 2 to column 5.

R




summary(wdbc_sampled[,2:5])


Output:


X17.99 X10.38 X122.8 X1001
Min. : 8.219 Min. : 9.71 Min. : 53.27 Min. : 203.9
1st Qu.:11.920 1st Qu.:16.33 1st Qu.: 76.39 1st Qu.: 432.4
Median :13.340 Median :18.53 Median : 85.74 Median : 546.4
Mean :14.137 Mean :19.14 Mean : 91.89 Mean : 653.1
3rd Qu.:15.520 3rd Qu.:21.53 3rd Qu.:102.53 3rd Qu.: 748.2
Max. :28.110 Max. :33.81 Max. :188.50 Max. :2499.0

We can see that there is a huge difference between the minimum and maximum values in almost each column, let’s normalize the variation.

R




data_norm <- function(x) {
  ((x - min(x)) / (max(x) - min(x))) #formula applied for normalization
}
 
wdbc_norm <- as.data.frame(lapply(wdbc_sampled[, -1], data_norm))
summary(wdbc_norm[,2:5])


Output:

     X10.38           X122.8           X1001            X0.1184      
Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
1st Qu.:0.2746 1st Qu.:0.1709 1st Qu.:0.09954 1st Qu.:0.3272
Median :0.3660 Median :0.2401 Median :0.14921 Median :0.4549
Mean :0.3913 Mean :0.2856 Mean :0.19572 Mean :0.4648
3rd Qu.:0.4905 3rd Qu.:0.3642 3rd Qu.:0.23715 3rd Qu.:0.5859
Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000

We have normalized the whole dataset except the first column where we have the categorical value which is not needed to be normalized. lets check how the values are available on the dataset

Now all values are in range of 0 and 1.

Split the dataset in train and test data

The sampled dataset is split into training and testing sets, by randomly sampling 70% of the dataset’s rows for training, and creating separate training and testing datasets based on the sampled indices. The resulting training set (wdbc_train) contains 70% of the original data, and the testing set (wdbc_test) consists of the remaining 30%. This data splitting is used in machine learning to assess model performance on unseen data.

R




set.seed(1234)
ind <- sample(2, nrow(wdbc_norm), replace = TRUE, prob = c(.7, .3))
wdbc_train <- wdbc_norm[ind == 1, ]
wdbc_test <- wdbc_norm[ind == 2, ]
 
# The class labels are in the first column
wdbc_train_labels <- wdbc_sampled[ind == 1, 1]
wdbc_test_labels <- wdbc_sampled[ind == 2, 1]
 
# Features start from the second column
wdbc_train_features <- wdbc_train[, 2:ncol(wdbc_train)]
wdbc_test_features <- wdbc_test[, 2:ncol(wdbc_test)]


KNN Model Training

Here we are using the “class” library to use KNN model. We are using a rule of thumb here to choose the value for K. We have 145 rows in the training data, so we are taking a square root of 145 which is approximately 12 and hence we have taken the K value as 12.

R




k <- 12
wdbc_pred <- knn(train = wdbc_train_features,
                 test = wdbc_test_features,
                 cl = wdbc_train_labels,
                 k = k)


Model Performance Analysis

R




#Confusion matrix
confusion_matrix <- table(Actual = wdbc_test_labels, Predicted = wdbc_pred)
print(confusion_matrix)
 
#Accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Accuracy:", accuracy, "\n")


Output:

      Predicted
Actual B M
B 32 1
M 3 19

Accuracy: 0.9272727

To evaluate the model performance we have created a confusion matrix to check the correct classification and miss-classifications.

  • And we can see that , Actual “B” and predicted as “B” are 32 instances and predicted as “M”: 1 instance. And Actual “M” but predicted as “B”: 3 instances and predicted as “M”: 19 instances.
  • Which summarizes our model’s accuracy as 92.7% which is considered to be very good.

Determine optimal K value and visualize

Though we have got a model accuracy of 92.7%, we must check for other K values if there exist a better K value for which we can get more accurate result. Let’s check for K values in a range of 1 to 20 to check which one is giving the best result.

R




# Range for K values
k_values <- 1:20
# To store accuracy values for different K values
accuracy_values <- numeric(length(k_values))
 
for (i in 1:length(k_values)) {
  wdbc_pred <- knn(train = wdbc_train_features,
                   test = wdbc_test_features,
                   cl = wdbc_train_labels,
                   k = k_values[i])
   
  confusion_matrix <- table(Actual = wdbc_test_labels, Predicted = wdbc_pred)
  accuracy_values[i] <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
}
 
# Create a data frame for ggplot
accuracy_df <- data.frame(k = k_values, accuracy = accuracy_values)
 
# Plot accuracy for different k values using ggplot2
library(ggplot2)
 
ggplot(accuracy_df, aes(x = k, y = accuracy)) +
  geom_point(color = "blue", size = 3) +
  geom_line(color = "blue") +
  labs(title = "Accuracy for Different k Values",
       x = "k",
       y = "Accuracy") +
  theme_minimal()
 
# Identify the optimal k value
optimal_k <- k_values[which.max(accuracy_values)]
cat("Optimal k value:", optimal_k, "\n")


Output:

Rplot05

Plot of Accuracy v/s K value

As we can see from the above plot there are better results available in other K values, and the optimal one is 2.

Optimal k value: 2 

So, we can check 2 neighbors of any datapoint to correctly predict which category it belongs to.

Conclusion

Applying data sampling in K-Nearest Neighbors (KNN) and k-Means clustering significantly enhances the accuracy and efficiency of these machine learning algorithms while using a large dataset, reducing outliers. Through systematic selection of representative subsets, we remove biases, and improve generalization. This also solves the problem of computational challenges.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads