# How to sample large database and implement K-means and Knn in R

Last Updated : 02 Feb, 2024

Sample large database becomes essential to reduce the amount of data for better and faster prediction. The goal is to extract a representative subset of data from a larger dataset for analysis, as it might be impractical or time-consuming to analyze the entire dataset. Whereas R is an open-source programming language and software environment for data analysis and statistical computing. In this article, we will learn how to sample a large dataset and implement machine learning algorithms like K-Nearest Neighbors (KNN) for classification and K-means for clustering, using the R programming language.

## What is Sampling?

Sampling is the process of selecting a subset according to the requirements known as a sample, from a larger population (Bigger Dataset) for analyzing the Large Database based on the characteristics observed within the sample. It is a fundamental method in statistics and data analysis, especially when analyzing large datasets where analysis of the entire dataset seems impractical and requires a lot of time. There three major techniques are used to sample large databases,

1. Random Sampling: Here every individual or element in the population has an equal chance of being included in the sample. R’s dplyr package, you can use the sample_n() function.
2. Stratified Sampling: In this, the population is divided into distinct subgroups based on certain characteristics. Then, random samples are independently taken from each subgroups.
3. Systematic Sampling: This type of sampling involves selecting every nth item from the population after randomly selecting a starting point.

### Steps to Sample large Dataset

1. Identify Population Set : Before beginning the sampling process, it is imperative to clearly define the objective and select the best suitable Population set from where sample set will be created.
2. Select a Suitable Method: Based on the stated purpose and characteristics of the dataset, determine which sampling method will best suit your needs. This may include simple random sampling, stratified sampling, or systematic sampling.
3. Determine the Optimal Sample Size: Consider both statistical requirements and desired precision when determining the appropriate sample size for your project.
4. Implement the Sampling Method: Extract your sample from the large dataset using relevant functions in your preferred data analysis tool, or by writing SQL queries.
5. Validate the Sample: To ensure the sample accurately represents the larger population, thoroughly check key characteristics. Perform any necessary analyses using the sample data if required.

### Implementation of K-Means in R

K-means is an unsupervised machine learning algorithm used for clustering. K-Means clustering is used to find intrinsic groups within the unlabeled dataset and draw inferences from them. It is based on centroid-based clustering.

Centroid – A centroid is a data point at the centre of a cluster. In centroid-based clustering, clusters are represented by a centroid. It is an iterative algorithm in which the notion of similarity is derived by how close a data point is to the centroid of the cluster. K-Means clustering works as follows:- The K-Means clustering algorithm uses an iterative procedure to deliver a final result. The algorithm requires number of clusters K and the data set as input. The data set is a collection of features for each data point. The algorithm starts with initial estimates for the K centroids.

It partitions a dataset into K clusters based on the similarity of data points, where K is a predefined number. K-means works with an unlabeled dataset and aims to group data points into clusters such that points within a cluster are more similar to each other than to points in other clusters.

• dplyr: The dplyr package is a powerful and popular package for data manipulation and transformation in R. It provides a set of functions that allows efficient manipulation of data frames and tibbles.
• cluster: The cluster package consists of Gaussian mixture models, k-means, mini-batch-kmeans, k-medoids and affinity propagation clustering algorithms with the option to plot, validate, predict (new data) and find the optimal number of clusters.
• ggfortify: ggfortify is an R package that enhances the visualization of statistical models using ggplot2. This package provides interface for producing plots for various statistical models, making it easier to explore and interpret the results.

## R

 `# Installing Packages` `install.packages``(``c``(``"dplyr"``, ``"ClusterR"``, ``"ggfortify"``))`   `# Loading Packages` `library``(dplyr)` `library``(ClusterR)` `library``(ggfortify)`

Here we are using Kaggle’s insurance dataset. It has columns as “age, sex, bmi, children, smoker, region, charges”. Firstly we have defined the path for the file and then we are accessing the file and storing it in a variable.

Dataset Link: US Health Insurance Dataset.

## R

 `file_path <- ``file.path``(``"C:/Users/subha/Downloads"``, ``"ushealth.csv"``)` `df <- ``read.csv``(file_path)` `head``(df)`

Output:

`  age    sex    bmi children smoker    region   charges1  19 female 27.900        0    yes southwest 16884.9242  18   male 33.770        1     no southeast  1725.5523  28   male 33.000        3     no southeast  4449.4624  33   male 22.705        0     no northwest 21984.4715  32   male 28.880        0     no northwest  3866.8556  31 female 25.740        0     no southeast  3756.622`

### Sample the Dataset

The dataset contains 1339 rows, here we will use sampling to learn how we can apply it in large dataset. let’s sample the data and make it 100 rows only.

Syntax:

sample(x, size, replace = FALSE, prob = NULL)

x: Data frame to sample from.

size: The number of random samples to draw.

replace: Logical. If TRUE, sampling is done with replacement If FALSE, sampling is done without replacement.

prob: A vector of probability weights for obtaining the elements of the vector being sampled. If not specified, the elements are equally likely to be selected.

## R

 `set.seed``(123)  ``# Set seed for reproducibility`   `# Data Sampling` `sampled_df <- df %>% ` `  ``sample_n``(size = 100, replace = ``FALSE``)`

## R

 `# Choosing only relevant columns` `df_1 <- sampled_df[, ``c``(``"age"``, ``"bmi"``)]`   `# Fitting K-Means Clustering Model` `set.seed``(240)  ``# Setting seed for reproducibility` `kmeans.re <- ``kmeans``(df_1, centers = 3, nstart = 20)`

set.seed(240): Sets the random seed to 240. This ensures that if we run the same code multiple times, we will get the same results

• kmeans(): This function is used to perform K-Means clustering.
• df_1: The dataset on which clustering is performed.
• centers = 3: Specifies the number of clusters you want to form.
• nstart = 20: The number of times the algorithm is run with different initial cluster centers. The final result is the best solution obtained across all runs.

## R

 `# Confusion Matrix (not applicable for K-Means, use table(kmeans.re\$cluster))` `cm <- ``table``(kmeans.re\$cluster)` `print``(``"Clusters:"``)` `print``(cm)`

Output:

`[1] "Clusters:" 1  2  3 42 28 30 `

The above table shows how datapoints are assigned to the clusters. Here cluster 1 has 42 datapoints to it and similarly cluster 2 has 28 and cluster 3 has 30.

## R

 `# Model Visualization` `clusplot``(df_1, kmeans.re\$cluster, lines = 0, shade = ``TRUE``, color = ``TRUE``, ` `         ``labels = 2, plotchar = ``FALSE``, span = ``TRUE``, main = ``"Cluster data"``, ` `         ``xlab = ``'Age'``, ylab = ``'BMI'``)`

Output:

We can observe that 3 clusters have been plotted and cluster 2 has minimal overlap with cluster 3.

## R

 `# Visualization using autoplot` `autoplot``(stats::``kmeans``(df_1, centers = 3), data = df_1)`

Output:

Three clusters are shown in the above plot where Orange colored points belong to Cluster1 and Green colored ones to Cluster2 and rest are from cluster3.

### Implementation of KNN in R

K-Nearest Neighbor or KNN is a Supervised machine learning algorithm used for classification. KNN doesnâ€™t make any assumption about underlying data or its distribution. It is one of the simplest and most widely used algorithms which depends on its k value(Neighbors). It is used for both classification and regression. It predicts a target variable using one or multiple independent variables. kNN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a good suite category by using K-NN algorithm. This algorithm is also known as a “lazy learner” as it only stores the data in the training phase and does not perform any calculation. KNN’s applications are there in healthcare industry, finance sector, etc.

The libraires we will be using for this KNN implementation are:

• dplyr: The dplyr package is a powerful and popular package for data manipulation and transformation in R. It provides a set of functions that allows efficient manipulation of data frames and tibbles.
• ggplot2: ggplot2 is an open-source data visualization package used in R to create graphics declaratively .
• class: This library is used to return the class attribute of an R object in R language. It has it’s use in various functions for classification, including k-nearest neighbour, Learning Vector Quantization and Self-Organizing Maps.

## R

 `# Installing Packages` `install.packages``(``c``(``"dplyr"``, ``"cluster"``, ``"ggfortify"``))`   `# Loading Packages` `library``(dplyr)` `library``(cluster)` `library``(ggfortify)` `# Install and load the class package` `install.packages``(``"class"``)` `library``(class)`

Here we are using Breast Cancer Wisconsin (Diagnostic) dataset which freely available on the internet.

Dataset Link: Breast Cancer Wisconsin (Diagnostic)

## R

 `file_path <- ``file.path``(``"C:/Users/subha/Downloads"``, ``"wdbc.csv"``) ``#File path ` `wdbc <- ``read.csv``(file_path) ``#reading the dataset` `head``(wdbc,2)`

Output:

`   X842302 M X17.99 X10.38 X122.8 X1001 X0.1184 X0.2776 X0.3001 X0.1471 X0.2419 X0.078711   842517 M  20.57  17.77  132.9  1326 0.08474 0.07864  0.0869 0.07017  0.1812  0.056672 84300903 M  19.69  21.25  130.0  1203 0.10960 0.15990  0.1974 0.12790  0.2069  0.05999  X1.095 X0.9053 X8.589 X153.4 X0.006399 X0.04904 X0.05373 X0.01587 X0.03003 X0.0061931 0.5435  0.7339  3.398  74.08  0.005225  0.01308  0.01860  0.01340  0.01389  0.0035322 0.7456  0.7869  4.585  94.03  0.006150  0.04006  0.03832  0.02058  0.02250  0.004571  X25.38 X17.33 X184.6 X2019 X0.1622 X0.6656 X0.7119 X0.2654 X0.4601 X0.11891  24.99  23.41  158.8  1956  0.1238  0.1866  0.2416   0.186  0.2750 0.089022  23.57  25.53  152.5  1709  0.1444  0.4245  0.4504   0.243  0.3613 0.08758`

The dataset consists of 568 rows and 32 columns.

If you observe carefully the second column or the second variable has categorical values and it is our target variable. Let’s exclude the first column from the dataset for betterment of our analysis.

## R

 `#removes first column` `wdbc <- wdbc[,-1]`

### Sample the Dataset

The dataset contains 568 rows, here we will use sampling to learn how we can apply it in large dataset. let’s sample the data and make it 200 rows only.

Syntax:

sample(x, size, replace = FALSE, prob = NULL)

x: Data frame to sample from.

size: The number of random samples to draw.

replace: Logical. If TRUE, sampling is done with replacement If FALSE, sampling is done without replacement.

prob: A vector of probability weights for obtaining the elements of the vector being sampled. If not specified, the elements are equally likely to be selected.

## R

 `set.seed``(123)  ``# Set seed for reproducibility`   `# Data Sampling ` `wdbc_sampled <- wdbc%>% ` `  ``sample_n``(size = 200, replace = ``FALSE``)`

### Data Normalization

Let’s check how the data values varies in our dataset. Let’s check for column 2 to column 5.

## R

 `summary``(wdbc_sampled[,2:5])`

Output:

`     X17.99           X10.38          X122.8           X1001        Min.   : 8.219   Min.   : 9.71   Min.   : 53.27   Min.   : 203.9   1st Qu.:11.920   1st Qu.:16.33   1st Qu.: 76.39   1st Qu.: 432.4   Median :13.340   Median :18.53   Median : 85.74   Median : 546.4   Mean   :14.137   Mean   :19.14   Mean   : 91.89   Mean   : 653.1   3rd Qu.:15.520   3rd Qu.:21.53   3rd Qu.:102.53   3rd Qu.: 748.2   Max.   :28.110   Max.   :33.81   Max.   :188.50   Max.   :2499.0 `

We can see that there is a huge difference between the minimum and maximum values in almost each column, let’s normalize the variation.

## R

 `data_norm <- ``function``(x) { ` `  ``((x - ``min``(x)) / (``max``(x) - ``min``(x))) ``#formula applied for normalization` `}`   `wdbc_norm <- ``as.data.frame``(``lapply``(wdbc_sampled[, -1], data_norm)) ` `summary``(wdbc_norm[,2:5])`

Output:

`     X10.38           X122.8           X1001            X0.1184       Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   1st Qu.:0.2746   1st Qu.:0.1709   1st Qu.:0.09954   1st Qu.:0.3272   Median :0.3660   Median :0.2401   Median :0.14921   Median :0.4549   Mean   :0.3913   Mean   :0.2856   Mean   :0.19572   Mean   :0.4648   3rd Qu.:0.4905   3rd Qu.:0.3642   3rd Qu.:0.23715   3rd Qu.:0.5859   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000 `

We have normalized the whole dataset except the first column where we have the categorical value which is not needed to be normalized. lets check how the values are available on the dataset

Now all values are in range of 0 and 1.

### Split the dataset in train and test data

The sampled dataset is split into training and testing sets, by randomly sampling 70% of the dataset’s rows for training, and creating separate training and testing datasets based on the sampled indices. The resulting training set (wdbc_train) contains 70% of the original data, and the testing set (wdbc_test) consists of the remaining 30%. This data splitting is used in machine learning to assess model performance on unseen data.

## R

 `set.seed``(1234)` `ind <- ``sample``(2, ``nrow``(wdbc_norm), replace = ``TRUE``, prob = ``c``(.7, .3))` `wdbc_train <- wdbc_norm[ind == 1, ]` `wdbc_test <- wdbc_norm[ind == 2, ]`   `# The class labels are in the first column` `wdbc_train_labels <- wdbc_sampled[ind == 1, 1]` `wdbc_test_labels <- wdbc_sampled[ind == 2, 1]`   `# Features start from the second column` `wdbc_train_features <- wdbc_train[, 2:``ncol``(wdbc_train)]` `wdbc_test_features <- wdbc_test[, 2:``ncol``(wdbc_test)]`

### KNN Model Training

Here we are using the “class” library to use KNN model. We are using a rule of thumb here to choose the value for K. We have 145 rows in the training data, so we are taking a square root of 145 which is approximately 12 and hence we have taken the K value as 12.

## R

 `k <- 12` `wdbc_pred <- ``knn``(train = wdbc_train_features, ` `                 ``test = wdbc_test_features, ` `                 ``cl = wdbc_train_labels, ` `                 ``k = k)`

## R

 `#Confusion matrix` `confusion_matrix <- ``table``(Actual = wdbc_test_labels, Predicted = wdbc_pred)` `print``(confusion_matrix)`   `#Accuracy` `accuracy <- ``sum``(``diag``(confusion_matrix)) / ``sum``(confusion_matrix)` `cat``(``"Accuracy:"``, accuracy, ``"\n"``)`

Output:

`      PredictedActual  B  M     B 32  1     M  3 19Accuracy: 0.9272727 `

To evaluate the model performance we have created a confusion matrix to check the correct classification and miss-classifications.

• And we can see that , Actual “B” and predicted as “B” are 32 instances and predicted as “M”: 1 instance. And Actual “M” but predicted as “B”: 3 instances and predicted as “M”: 19 instances.
• Which summarizes our model’s accuracy as 92.7% which is considered to be very good.

### Determine optimal K value and visualize

Though we have got a model accuracy of 92.7%, we must check for other K values if there exist a better K value for which we can get more accurate result. Let’s check for K values in a range of 1 to 20 to check which one is giving the best result.

## R

 `# Range for K values` `k_values <- 1:20 ` `# To store accuracy values for different K values` `accuracy_values <- ``numeric``(``length``(k_values)) `   `for ``(i ``in` `1:``length``(k_values)) {` `  ``wdbc_pred <- ``knn``(train = wdbc_train_features, ` `                   ``test = wdbc_test_features, ` `                   ``cl = wdbc_train_labels, ` `                   ``k = k_values[i])` `  `  `  ``confusion_matrix <- ``table``(Actual = wdbc_test_labels, Predicted = wdbc_pred)` `  ``accuracy_values[i] <- ``sum``(``diag``(confusion_matrix)) / ``sum``(confusion_matrix)` `}`   `# Create a data frame for ggplot` `accuracy_df <- ``data.frame``(k = k_values, accuracy = accuracy_values)`   `# Plot accuracy for different k values using ggplot2` `library``(ggplot2)`   `ggplot``(accuracy_df, ``aes``(x = k, y = accuracy)) +` `  ``geom_point``(color = ``"blue"``, size = 3) +` `  ``geom_line``(color = ``"blue"``) +` `  ``labs``(title = ``"Accuracy for Different k Values"``,` `       ``x = ``"k"``,` `       ``y = ``"Accuracy"``) +` `  ``theme_minimal``()`   `# Identify the optimal k value` `optimal_k <- k_values[``which.max``(accuracy_values)]` `cat``(``"Optimal k value:"``, optimal_k, ``"\n"``)`

Output:

Plot of Accuracy v/s K value

As we can see from the above plot there are better results available in other K values, and the optimal one is 2.

`Optimal k value: 2 `

So, we can check 2 neighbors of any datapoint to correctly predict which category it belongs to.

### Conclusion

Applying data sampling in K-Nearest Neighbors (KNN) and k-Means clustering significantly enhances the accuracy and efficiency of these machine learning algorithms while using a large dataset, reducing outliers. Through systematic selection of representative subsets, we remove biases, and improve generalization. This also solves the problem of computational challenges.

Previous
Next