Sample large database becomes essential to reduce the amount of data for better and faster prediction. The goal is to extract a representative subset of data from a larger dataset for analysis, as it might be impractical or time-consuming to analyze the entire dataset. Whereas R is an open-source programming language and software environment for data analysis and statistical computing. In this article, we will learn how to sample a large dataset and implement machine learning algorithms like K-Nearest Neighbors (KNN) for classification and K-means for clustering, using the R programming language.

## What is Sampling?

Sampling is the process of selecting a subset according to the requirements known as a sample, from a larger population (Bigger Dataset) for analyzing the Large Database based on the characteristics observed within the sample. It is a fundamental method in statistics and data analysis, especially when analyzing large datasets where analysis of the entire dataset seems impractical and requires a lot of time. There three major techniques are used to sample large databases,

Here every individual or element in the population has an equal chance of being included in the sample. R’s dplyr package, you can use the sample_n() function.**Random Sampling:**In this, the population is divided into distinct subgroups based on certain characteristics. Then, random samples are independently taken from each subgroups.**Stratified Sampling:**This type of sampling involves selecting every nth item from the population after randomly selecting a starting point.**Systematic Sampling:**

### Steps to Sample large Dataset

Before beginning the sampling process, it is imperative to clearly define the objective and select the best suitable Population set from where sample set will be created.**Identify Population Set :**Based on the stated purpose and characteristics of the dataset, determine which sampling method will best suit your needs. This may include simple random sampling, stratified sampling, or systematic sampling.**Select a Suitable Method:**: Consider both statistical requirements and desired precision when determining the appropriate sample size for your project.**Determine the Optimal Sample Size**Extract your sample from the large dataset using relevant functions in your preferred data analysis tool, or by writing SQL queries.**Implement the Sampling Method:**To ensure the sample accurately represents the larger population, thoroughly check key characteristics. Perform any necessary analyses using the sample data if required.**Validate the Sample:**

### Implementation of K-Means in R

K-means is an unsupervised machine learning algorithm used for clustering. K-Means clustering is used to find intrinsic groups within the unlabeled dataset and draw inferences from them. It is based on centroid-based clustering.

** Centroid –** A centroid is a data point at the centre of a cluster. In centroid-based clustering, clusters are represented by a centroid. It is an iterative algorithm in which the notion of similarity is derived by how close a data point is to the centroid of the cluster. K-Means clustering works as follows:- The K-Means clustering algorithm uses an iterative procedure to deliver a final result. The algorithm requires number of clusters K and the data set as input. The data set is a collection of features for each data point. The algorithm starts with initial estimates for the K centroids.

It partitions a dataset into K clusters based on the similarity of data points, where K is a predefined number. K-means works with an unlabeled dataset and aims to group data points into clusters such that points within a cluster are more similar to each other than to points in other clusters.

### Install and Load Packages

The dplyr package is a powerful and popular package for data manipulation and transformation in R. It provides a set of functions that allows efficient manipulation of data frames and tibbles.**dplyr:**

The cluster package consists of Gaussian mixture models, k-means, mini-batch-kmeans, k-medoids and affinity propagation clustering algorithms with the option to plot, validate, predict (new data) and find the optimal number of clusters.**cluster:**

ggfortify is an R package that enhances the visualization of statistical models using ggplot2. This package provides interface for producing plots for various statistical models, making it easier to explore and interpret the results.**ggfortify:**

## R

`# Installing Packages` `install.packages` `(` `c` `(` `"dplyr"` `, ` `"ClusterR"` `, ` `"ggfortify"` `))` `# Loading Packages` `library` `(dplyr)` `library` `(ClusterR)` `library` `(ggfortify)` |

### Load Dataset

Here we are using Kaggle’s insurance dataset. It has columns as “age, sex, bmi, children, smoker, region, charges”. Firstly we have defined the path for the file and then we are accessing the file and storing it in a variable.

** Dataset Link: **US Health Insurance Dataset.

## R

`file_path <- ` `file.path` `(` `"C:/Users/subha/Downloads"` `, ` `"ushealth.csv"` `)` `df <- ` `read.csv` `(file_path)` `head` `(df)` |

**Output:**

age sex bmi children smoker region charges

1 19 female 27.900 0 yes southwest 16884.924

2 18 male 33.770 1 no southeast 1725.552

3 28 male 33.000 3 no southeast 4449.462

4 33 male 22.705 0 no northwest 21984.471

5 32 male 28.880 0 no northwest 3866.855

6 31 female 25.740 0 no southeast 3756.622

### Sample the Dataset

The dataset contains 1339 rows, here we will use sampling to learn how we can apply it in large dataset. let’s sample the data and make it 100 rows only.

Syntax:

sample(x, size, replace = FALSE, prob = NULL)

x: Data frame to sample from.

size: The number of random samples to draw.

replace: Logical. If TRUE, sampling is done with replacement If FALSE, sampling is done without replacement.

prob: A vector of probability weights for obtaining the elements of the vector being sampled. If not specified, the elements are equally likely to be selected.

## R

`set.seed` `(123) ` `# Set seed for reproducibility` `# Data Sampling` `sampled_df <- df %>% ` ` ` `sample_n` `(size = 100, replace = ` `FALSE` `)` |

### Model Fitting

## R

`# Choosing only relevant columns` `df_1 <- sampled_df[, ` `c` `(` `"age"` `, ` `"bmi"` `)]` `# Fitting K-Means Clustering Model` `set.seed` `(240) ` `# Setting seed for reproducibility` `kmeans.re <- ` `kmeans` `(df_1, centers = 3, nstart = 20)` |

** set.seed(240):** Sets the random seed to 240. This ensures that if we run the same code multiple times, we will get the same results

This function is used to perform K-Means clustering.**kmeans():**: The dataset on which clustering is performed.**df_1**= 3: Specifies the number of clusters you want to form.**centers**The number of times the algorithm is run with different initial cluster centers. The final result is the best solution obtained across all runs.**nstart = 20:**

### Cluster Table

## R

`# Confusion Matrix (not applicable for K-Means, use table(kmeans.re$cluster))` `cm <- ` `table` `(kmeans.re$cluster)` `print` `(` `"Clusters:"` `)` `print` `(cm)` |

**Output:**

[1] "Clusters:"

1 2 3

42 28 30

The above table shows how datapoints are assigned to the clusters. Here cluster 1 has 42 datapoints to it and similarly cluster 2 has 28 and cluster 3 has 30.

### Model Visualization

## R

`# Model Visualization` `clusplot` `(df_1, kmeans.re$cluster, lines = 0, shade = ` `TRUE` `, color = ` `TRUE` `, ` ` ` `labels = 2, plotchar = ` `FALSE` `, span = ` `TRUE` `, main = ` `"Cluster data"` `, ` ` ` `xlab = ` `'Age'` `, ylab = ` `'BMI'` `)` |

**Output:**

We can observe that 3 clusters have been plotted and cluster 2 has minimal overlap with cluster 3.

## R

`# Visualization using autoplot` `autoplot` `(stats::` `kmeans` `(df_1, centers = 3), data = df_1)` |

**Output:**

Three clusters are shown in the above plot where Orange colored points belong to Cluster1 and Green colored ones to Cluster2 and rest are from cluster3.

### Implementation of KNN in R

K-Nearest Neighbor or KNN is a Supervised machine learning algorithm used for classification. KNN doesnâ€™t make any assumption about underlying data or its distribution. It is one of the simplest and most widely used algorithms which depends on its k value(Neighbors). It is used for both classification and regression. It predicts a target variable using one or multiple independent variables. kNN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a good suite category by using K-NN algorithm. This algorithm is also known as a “lazy learner” as it only stores the data in the training phase and does not perform any calculation. KNN’s applications are there in healthcare industry, finance sector, etc.

### Install and Load Packages

The libraires we will be using for this KNN implementation are:

The dplyr package is a powerful and popular package for data manipulation and transformation in R. It provides a set of functions that allows efficient manipulation of data frames and tibbles.**dplyr:**- ggplot2: ggplot2 is an open-source data visualization package used in R to create graphics declaratively .
This library is used to return the class attribute of an R object in R language. It has it’s use in various functions for classification, including k-nearest neighbour, Learning Vector Quantization and Self-Organizing Maps.**class:**

## R

`# Installing Packages` `install.packages` `(` `c` `(` `"dplyr"` `, ` `"cluster"` `, ` `"ggfortify"` `))` `# Loading Packages` `library` `(dplyr)` `library` `(cluster)` `library` `(ggfortify)` `# Install and load the class package` `install.packages` `(` `"class"` `)` `library` `(class)` |

### Load Dataset and explore

Here we are using Breast Cancer Wisconsin (Diagnostic) dataset which freely available on the internet.

** Dataset Link: **Breast Cancer Wisconsin (Diagnostic)

## R

`file_path <- ` `file.path` `(` `"C:/Users/subha/Downloads"` `, ` `"wdbc.csv"` `) ` `#File path ` `wdbc <- ` `read.csv` `(file_path) ` `#reading the dataset` `head` `(wdbc,2)` |

**Output:**

X842302 M X17.99 X10.38 X122.8 X1001 X0.1184 X0.2776 X0.3001 X0.1471 X0.2419 X0.07871

1 842517 M 20.57 17.77 132.9 1326 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667

2 84300903 M 19.69 21.25 130.0 1203 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999

X1.095 X0.9053 X8.589 X153.4 X0.006399 X0.04904 X0.05373 X0.01587 X0.03003 X0.006193

1 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.01860 0.01340 0.01389 0.003532

2 0.7456 0.7869 4.585 94.03 0.006150 0.04006 0.03832 0.02058 0.02250 0.004571

X25.38 X17.33 X184.6 X2019 X0.1622 X0.6656 X0.7119 X0.2654 X0.4601 X0.1189

1 24.99 23.41 158.8 1956 0.1238 0.1866 0.2416 0.186 0.2750 0.08902

2 23.57 25.53 152.5 1709 0.1444 0.4245 0.4504 0.243 0.3613 0.08758

The dataset consists of 568 rows and 32 columns.

If you observe carefully the second column or the second variable has categorical values and it is our target variable. Let’s exclude the first column from the dataset for betterment of our analysis.

## R

`#removes first column` `wdbc <- wdbc[,-1]` |

### Sample the Dataset

The dataset contains 568 rows, here we will use sampling to learn how we can apply it in large dataset. let’s sample the data and make it 200 rows only.

Syntax:

sample(x, size, replace = FALSE, prob = NULL)

x: Data frame to sample from.

size: The number of random samples to draw.

replace: Logical. If TRUE, sampling is done with replacement If FALSE, sampling is done without replacement.

prob: A vector of probability weights for obtaining the elements of the vector being sampled. If not specified, the elements are equally likely to be selected.

## R

`set.seed` `(123) ` `# Set seed for reproducibility` `# Data Sampling ` `wdbc_sampled <- wdbc%>% ` ` ` `sample_n` `(size = 200, replace = ` `FALSE` `)` |

### Data Normalization

Let’s check how the data values varies in our dataset. Let’s check for column 2 to column 5.

## R

`summary` `(wdbc_sampled[,2:5])` |

**Output:**

X17.99 X10.38 X122.8 X1001

Min. : 8.219 Min. : 9.71 Min. : 53.27 Min. : 203.9

1st Qu.:11.920 1st Qu.:16.33 1st Qu.: 76.39 1st Qu.: 432.4

Median :13.340 Median :18.53 Median : 85.74 Median : 546.4

Mean :14.137 Mean :19.14 Mean : 91.89 Mean : 653.1

3rd Qu.:15.520 3rd Qu.:21.53 3rd Qu.:102.53 3rd Qu.: 748.2

Max. :28.110 Max. :33.81 Max. :188.50 Max. :2499.0

We can see that there is a huge difference between the minimum and maximum values in almost each column, let’s normalize the variation.

## R

`data_norm <- ` `function` `(x) { ` ` ` `((x - ` `min` `(x)) / (` `max` `(x) - ` `min` `(x))) ` `#formula applied for normalization` `}` `wdbc_norm <- ` `as.data.frame` `(` `lapply` `(wdbc_sampled[, -1], data_norm)) ` `summary` `(wdbc_norm[,2:5])` |

**Output:**

X10.38 X122.8 X1001 X0.1184

Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000

1st Qu.:0.2746 1st Qu.:0.1709 1st Qu.:0.09954 1st Qu.:0.3272

Median :0.3660 Median :0.2401 Median :0.14921 Median :0.4549

Mean :0.3913 Mean :0.2856 Mean :0.19572 Mean :0.4648

3rd Qu.:0.4905 3rd Qu.:0.3642 3rd Qu.:0.23715 3rd Qu.:0.5859

Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000

We have normalized the whole dataset except the first column where we have the categorical value which is not needed to be normalized. lets check how the values are available on the dataset

Now all values are in range of 0 and 1.

### Split the dataset in train and test data

The sampled dataset is split into training and testing sets, by randomly sampling 70% of the dataset’s rows for training, and creating separate training and testing datasets based on the sampled indices. The resulting training set (wdbc_train) contains 70% of the original data, and the testing set (wdbc_test) consists of the remaining 30%. This data splitting is used in machine learning to assess model performance on unseen data.

## R

`set.seed` `(1234)` `ind <- ` `sample` `(2, ` `nrow` `(wdbc_norm), replace = ` `TRUE` `, prob = ` `c` `(.7, .3))` `wdbc_train <- wdbc_norm[ind == 1, ]` `wdbc_test <- wdbc_norm[ind == 2, ]` `# The class labels are in the first column` `wdbc_train_labels <- wdbc_sampled[ind == 1, 1]` `wdbc_test_labels <- wdbc_sampled[ind == 2, 1]` `# Features start from the second column` `wdbc_train_features <- wdbc_train[, 2:` `ncol` `(wdbc_train)]` `wdbc_test_features <- wdbc_test[, 2:` `ncol` `(wdbc_test)]` |

### KNN Model Training

Here we are using the “class” library to use KNN model. We are using a rule of thumb here to choose the value for K. We have 145 rows in the training data, so we are taking a square root of 145 which is approximately 12 and hence we have taken the K value as 12.

## R

`k <- 12` `wdbc_pred <- ` `knn` `(train = wdbc_train_features, ` ` ` `test = wdbc_test_features, ` ` ` `cl = wdbc_train_labels, ` ` ` `k = k)` |

### Model Performance Analysis

## R

`#Confusion matrix` `confusion_matrix <- ` `table` `(Actual = wdbc_test_labels, Predicted = wdbc_pred)` `print` `(confusion_matrix)` `#Accuracy` `accuracy <- ` `sum` `(` `diag` `(confusion_matrix)) / ` `sum` `(confusion_matrix)` `cat` `(` `"Accuracy:"` `, accuracy, ` `"\n"` `)` |

**Output:**

Predicted

Actual B M

B 32 1

M 3 19

Accuracy: 0.9272727

To evaluate the model performance we have created a confusion matrix to check the correct classification and miss-classifications.

- And we can see that , Actual “B” and predicted as “B” are
instances and predicted as “M”:**32**nstance. And Actual “M” but predicted as “B”:**1 i**instances and predicted as “M”:**3**instances.**19** - Which summarizes our model’s accuracy as
which is considered to be very good.**92.7%**

### Determine optimal K value and visualize

Though we have got a model accuracy of 92.7%, we must check for other K values if there exist a better K value for which we can get more accurate result. Let’s check for K values in a range of 1 to 20 to check which one is giving the best result.

## R

`# Range for K values` `k_values <- 1:20 ` `# To store accuracy values for different K values` `accuracy_values <- ` `numeric` `(` `length` `(k_values)) ` `for ` `(i ` `in` `1:` `length` `(k_values)) {` ` ` `wdbc_pred <- ` `knn` `(train = wdbc_train_features, ` ` ` `test = wdbc_test_features, ` ` ` `cl = wdbc_train_labels, ` ` ` `k = k_values[i])` ` ` ` ` `confusion_matrix <- ` `table` `(Actual = wdbc_test_labels, Predicted = wdbc_pred)` ` ` `accuracy_values[i] <- ` `sum` `(` `diag` `(confusion_matrix)) / ` `sum` `(confusion_matrix)` `}` `# Create a data frame for ggplot` `accuracy_df <- ` `data.frame` `(k = k_values, accuracy = accuracy_values)` `# Plot accuracy for different k values using ggplot2` `library` `(ggplot2)` `ggplot` `(accuracy_df, ` `aes` `(x = k, y = accuracy)) +` ` ` `geom_point` `(color = ` `"blue"` `, size = 3) +` ` ` `geom_line` `(color = ` `"blue"` `) +` ` ` `labs` `(title = ` `"Accuracy for Different k Values"` `,` ` ` `x = ` `"k"` `,` ` ` `y = ` `"Accuracy"` `) +` ` ` `theme_minimal` `()` `# Identify the optimal k value` `optimal_k <- k_values[` `which.max` `(accuracy_values)]` `cat` `(` `"Optimal k value:"` `, optimal_k, ` `"\n"` `)` |

**Output: **

As we can see from the above plot there are better results available in other K values, and the optimal one is ** 2**.

Optimal k value: 2

So, we can check 2** neighbors** of any datapoint to correctly predict which category it belongs to.

### Conclusion

Applying data sampling in K-Nearest Neighbors (KNN) and k-Means clustering significantly enhances the accuracy and efficiency of these machine learning algorithms while using a large dataset, reducing outliers. Through systematic selection of representative subsets, we remove biases, and improve generalization. This also solves the problem of computational challenges.