Open In App

Classification on a large and noisy dataset with R

Last Updated : 17 Apr, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will discuss What is noisy data and perform Classification on a large and noisy dataset with R Programming Language.

What is noisy data?

Noise in data refers to random or irrelevant information that interferes with the analysis or interpretation of the data. It can include errors, inconsistencies, outliers, or irrelevant features that make it harder to extract meaningful insights or build accurate models.

Noise can come in different forms

  1. Random Errors: Unpredictable mistakes during data collection, like typos or sensor malfunctions, causing inconsistencies or outliers.
  2. Systematic Errors: Consistent biases across data due to measurement flaws or calibration issues, distorting relationships between variables.
  3. Missing Values: Empty data points, if not handled properly, can skew analysis results.
  4. Outliers: Data points significantly different from the rest, often due to measurement errors or rare events, impacting statistical measures and model performance.
  5. Irrelevant Features: Features with no useful information for analysis, increasing data complexity and risking overfitting.
  6. Ambiguity or Inconsistency: Unclear or conflicting data, making interpretation and analysis difficult, stemming from collection method inconsistencies or coding errors.

Methods of identify noise in a dataset

Identifying noise in a dataset involves various techniques depending on the nature of the data and the specific types of noise present.

  1. Visual Inspection:
    • Plotting the data using scatter plots, histograms, or box plots can reveal outliers or patterns indicative of noise.
    • Visualizing relationships between variables can help identify inconsistencies or unexpected patterns.
  2. Statistical Methods:
    • Calculating summary statistics such as mean, median, standard deviation, and range can help identify outliers or extreme values.
    • Using measures like skewness or kurtosis can detect departures from expected data distributions, indicating potential noise.
    • Quantile-based methods, such as the interquartile range (IQR) or Z-score, can identify observations that fall outside normal ranges.
  3. Machine Learning Models:
    • Train a model on the dataset and analyze the residuals (the differences between actual and predicted values). Large residuals may indicate noisy data points.
    • Models like isolation forests or one-class SVMs can be used for outlier detection.
  4. Domain Knowledge:
    • Understanding the context of the data and the domain it represents can help identify inconsistencies or errors.
  5. Clustering:
    • Clustering techniques can help identify groups of similar data points. Observations that do not fit well into any cluster may be considered noisy.
    • Density-based clustering algorithms like DBSCAN can automatically identify outliers as points in low-density regions.
  6. Data Quality Metrics:
    • Define and calculate data quality metrics specific to your dataset, such as completeness (presence of missing values), consistency (lack of contradictions), or accuracy (degree of error).

When dealing with a large and noisy dataset for classification in R, there are some techniques that can handle both the scale of the data and the noise effectively.

  • Data Preprocessing:
    • Handle missing values by imputation or removal.
    • Detect and decide on outliers.
    • Scale or normalize features.
    • Consider feature selection or dimensionality reduction.
  • Model Selection:
    • Choose appropriate algorithms like Random Forest, SVM, or Neural Networks.
    • Experiment with ensemble methods like bagging and boosting.
    • Tune hyperparameters using cross-validation.
  • Model Training:
    • Train models on subsets of data or use mini-batch gradient descent.
    • Utilize parallel processing for faster training.
  • Model Evaluation:
    • Evaluate using metrics like accuracy, precision, recall, and F1-score.
    • Use cross-validation for robust evaluation.
    • Pay attention to noise-robust metrics like F1-score or AUC.
  • Handling Noise:
    • Use noise-tolerant algorithms or robust optimization methods.
    • Employ ensemble methods like bagging.
    • Consider post-processing techniques like thresholding or filtering.
  • Model Deployment and Monitoring:
    • Deploy the model in production.
    • Monitor performance over time.
    • Gather feedback and retrain as needed.

Here we use a real dataset. We use the “Weather History” Data Set”.

Dataset Link : weatherHistory

  • Random Forest is applied here for classification task using a dataset derived from weather observations.
  • This dataset likely contains various weather-related features such as temperature, humidity, wind speed, and visibility.
  • Classification involves predicting the ‘Summary’ of weather conditions based on these features, such as ‘Clear’, ‘Partly Cloudy’, or ‘Rainy’.
  • Characteristics :
    • Size: The dataset is large in size and containing thousands or more observations.
    • Noise: Weather data can be prone to noise due to measurement errors, outliers, or inconsistent reporting.
    • Complexity: Weather patterns can exhibit complex relationships, making accurate prediction challenging.
  • Random Forest is chosen for its ability to handle large, noisy datasets and its robustness to overfitting.
  • By aggregating multiple decision trees trained on random subsets of data and features, Random Forest can effectively capture patterns in the data and make accurate predictions despite noise and complexity.
R
# Load necessary libraries
library(randomForest)

# Read the dataset
data <- read.csv("Your path/weatherHistory.csv")

# Explore the structure of the dataset
dim(data)
head(data)
str(data)

Output:

[1] 96453    12

                 Formatted.Date       Summary Precip.Type Temperature..C.
1 2006-04-01 00:00:00.000 +0200 Partly Cloudy        rain        9.472222
2 2006-04-01 01:00:00.000 +0200 Partly Cloudy        rain        9.355556
3 2006-04-01 02:00:00.000 +0200 Mostly Cloudy        rain        9.377778
4 2006-04-01 03:00:00.000 +0200 Partly Cloudy        rain        8.288889
5 2006-04-01 04:00:00.000 +0200 Mostly Cloudy        rain        8.755556
6 2006-04-01 05:00:00.000 +0200 Partly Cloudy        rain        9.222222
  Apparent.Temperature..C. Humidity Wind.Speed..km.h. Wind.Bearing..degrees.
1                 7.388889     0.89           14.1197                    251
2                 7.227778     0.86           14.2646                    259
3                 9.377778     0.89            3.9284                    204
4                 5.944444     0.83           14.1036                    269
5                 6.977778     0.83           11.0446                    259
6                 7.111111     0.85           13.9587                    258
  Visibility..km. Loud.Cover Pressure..millibars.                     Daily.Summary
1         15.8263          0              1015.13 Partly cloudy throughout the day.
2         15.8263          0              1015.63 Partly cloudy throughout the day.
3         14.9569          0              1015.94 Partly cloudy throughout the day.
4         15.8263          0              1016.41 Partly cloudy throughout the day.
5         15.8263          0              1016.51 Partly cloudy throughout the day.
6         14.9569          0              1016.66 Partly cloudy throughout the day.


'data.frame':    96453 obs. of  12 variables:
 $ Formatted.Date          : Factor w/ 96429 levels "2006-01-01 00:00:00.000 +0100",..: 2160 2161 2162 2163 2164
 $ Summary                 : Factor w/ 27 levels "Breezy","Breezy and Dry",..: 20 20 18 20 18 20 20 20 20 20 ...
 $ Precip.Type             : Factor w/ 3 levels "null","rain",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Temperature..C.         : num  9.47 9.36 9.38 8.29 8.76 ...
 $ Apparent.Temperature..C.: num  7.39 7.23 9.38 5.94 6.98 ...
 $ Humidity                : num  0.89 0.86 0.89 0.83 0.83 0.85 0.95 0.89 0.82 0.72 ...
 $ Wind.Speed..km.h.       : num  14.12 14.26 3.93 14.1 11.04 ...
 $ Wind.Bearing..degrees.  : num  251 259 204 269 259 258 259 260 259 279 ...
 $ Visibility..km.         : num  15.8 15.8 15 15.8 15.8 ...
 $ Loud.Cover              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Pressure..millibars.    : num  1015 1016 1016 1016 1017 ...
 $ Daily.Summary           : Factor w/ 214 levels "Breezy and foggy starting in the evening

First load the necessary libraries: randomForest for modeling and ggplot2 for visualization. Then read the weather dataset from a CSV file. Explore the structure of the dataset using str().

Temperature Distribution visualization

R
library(ggplot2)

# Create a histogram with adjusted colors
ggplot(data, aes(x = Temperature..C.)) +
  geom_histogram(bins = 30, fill = "red", color = "black", alpha = 0.7) +
  labs(x = "Temperature (°C)", y = "Count", title = "Temperature Distribution")

Output:

gh

Classification on a large and noisy dataset with R

Data preprocessing on a large and noisy dataset

R
# Data preprocessing
data$Summary <- as.factor(data$Summary)
data <- data[, -c(1, 11, 12)]  # Removing 'Formatted Date', 'Daily Summary'
data <- na.omit(data)  # Remove rows with missing values
sum(is.na(data))

Output:

[1] 0

Convert the ‘Summary’ column to a factor (categorical variable). Remove unnecessary columns (‘Formatted Date’, ‘Daily Summary’). Handle missing values by removing rows with any missing data.

Split the dataset into training and testing sets

R
# Split the dataset into training and testing sets
set.seed(123)  # for reproducibility
train_index <- sample(1:nrow(data), 0.8 * nrow(data))
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

# Train the random forest model
rf_model <- randomForest(Summary ~ ., data = train_data, ntree = 500)

summary(rf_model)

Output:

               Length  Class  Mode     
call                  4 -none- call     
type                  1 -none- character
predicted         77162 factor numeric  
err.rate          14000 -none- numeric  
confusion           756 -none- numeric  
votes           2083374 matrix numeric  
oob.times         77162 -none- numeric  
classes              27 -none- character
importance            8 -none- numeric  
importanceSD          0 -none- NULL     
localImportance       0 -none- NULL     
proximity             0 -none- NULL     
ntree                 1 -none- numeric  
mtry                  1 -none- numeric  
forest               14 -none- list     
y                 77162 factor numeric  
test                  0 -none- NULL     
inbag                 0 -none- NULL     
terms                 3 terms  call  

Split the dataset into training and testing sets (80% training, 20% testing).Train a Random Forest model using the training data (randomForest() function). Specify the target variable (‘Summary’) and all other columns as predictors.

Predict on the test set

R
# Predict on the test set
predictions <- predict(rf_model, test_data)

# Evaluate the model
confusion_matrix <- table(predictions, test_data$Summary)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Accuracy:", accuracy, "\n")

Output:

Accuracy: 0.5496864 

Make predictions on the test set using the trained model (predict() function).

  • Evaluate the model’s performance:
  • Generate a confusion matrix comparing predicted vs. actual values.
  • Calculate accuracy as the ratio of correct predictions to total predictions.
  • The confusion matrix displays the counts of true positive, true negative, false positive, and false negative predictions made by the Random Forest model.
  • It provides a summary of how well the model performed in classifying the different categories (‘Summary’).
  • The accuracy of the model is calculated as the ratio of correctly classified instances to the total number of instances in the test set.
  • In this case, the accuracy is approximately 54.83%, indicating that the model correctly classified around 54.83% of the instances in the test set.

Conclusion

In short, classifying large and noisy datasets in R requires preprocessing to handle missing values and noise, selecting robust algorithms like Random Forest or SVMs, evaluating performance using metrics and visualizations, and ensuring model stability. These steps are crucial for accurate classification despite the challenges posed by noisy data.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads