Open In App

Analysis of Car Sales Data in R

Car sales data can provide valuable insights into the automotive market, helping us understand the factors that influence prices, sales, and overall market trends. The dataset includes information on Price_in_thousands, Engine_size, Horsepower, Fuel_efficiency, and sales. Our goal is to analyze and visualize this data to uncover patterns, relationships, and trends in the R Programming Language.

Dataset Link: Car Sales

Load Packages and Data

# Install and load necessary packages
install.packages(c("tidyverse", "ggplot2", "plotly", "lubridate"))
library(tidyverse)
library(ggplot2)
library(plotly)
library(lubridate)

# Load the dataset
car_data <- read.csv("your\\path")

head(car_data)

Output:

  Price_in_thousands Engine_size Horsepower Fuel_efficiency  sales
1 21.50 1.8 140 28 16.919
2 28.40 3.2 225 25 39.384
3 NA 3.2 225 26 14.114
4 42.00 3.5 210 22 8.588
5 23.99 1.8 150 27 20.397
6 33.95 2.8 200 22 18.780

Check the Structure

# Check the structure of the dataset
str(car_data)

Output:

'data.frame':    157 obs. of  5 variables:
$ Price_in_thousands: num 21.5 28.4 NA 42 24 ...
$ Engine_size : num 1.8 3.2 3.2 3.5 1.8 2.8 4.2 2.5 2.8 2.8 ...
$ Horsepower : int 140 225 225 210 150 200 310 170 193 193 ...
$ Fuel_efficiency : int 28 25 26 22 27 22 21 26 24 25 ...
$ sales : num 16.92 39.38 14.11 8.59 20.4 ...

Check the Summary

# Display summary statistics
summary(car_data)

Output:

 Price_in_thousands  Engine_size      Horsepower    Fuel_efficiency
Min. : 9.235 Min. :1.000 Min. : 55.0 Min. :15.00
1st Qu.:18.017 1st Qu.:2.300 1st Qu.:149.5 1st Qu.:21.00
Median :22.799 Median :3.000 Median :177.5 Median :24.00
Mean :27.391 Mean :3.061 Mean :185.9 Mean :23.84
3rd Qu.:31.948 3rd Qu.:3.575 3rd Qu.:215.0 3rd Qu.:26.00
Max. :85.500 Max. :8.000 Max. :450.0 Max. :45.00
NA's :2 NA's :1 NA's :1 NA's :3
sales
Min. : 0.11
1st Qu.: 14.11
Median : 29.45
Mean : 53.00
3rd Qu.: 67.96
Max. :540.56

Check Null Value and Duplicate Rows

# Check for null values
colSums(is.na(car_data))

# Check for duplicate records
duplicate_rows <- car_data[duplicated(car_data), ]

# Display the duplicate rows
print(duplicate_rows)

Output:

Price_in_thousands        Engine_size         Horsepower 
2 1 1
Fuel_efficiency sales
3 0

[1] Price_in_thousands Engine_size Horsepower
[4] Fuel_efficiency sales
<0 rows> (or 0-length row.names)

Visualization of Car Sales Data in R

# Install and load necessary packages
# install.packages("ggplot2")
library(ggplot2)

# Load the dataset (replace with your actual dataset path)
car_data <- read.csv("your//path")

# Customize Scatterplot
custom_scatterplot <- ggplot(car_data, aes(x = Horsepower, y = Price_in_thousands, 
                                           color = Fuel_efficiency)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_color_gradient(low = "blue", high = "red") +  # Adjust color scale
  labs(
    title = "Customized Scatterplot",
    x = "Horsepower",
    y = "Price_in_thousands",
    color = "Fuel_efficiency"
  ) +
  theme_minimal()

# Show the scatterplot
print(custom_scatterplot)

Output:

gh

Analyzing Car Sales Data in R

Data Points Visualization: The scatterplot visually displays data points from a car dataset, where each point represents a car.

Histogram with Distplot for Analyzing Car Sales Data in R

# Assuming 'Price_in_thousands' is the column of interest in your 'car_data' dataframe
ggplot(car_data, aes(x = Price_in_thousands)) +
  geom_histogram(aes(y = ..density..), fill = "skyblue", color = "black", bins = 30) +
  geom_density(alpha = 0.7, fill = "orange") +
  labs(title = "Histogram with Distribution Plot Overlay for Price_in_thousands",
       x = "Price_in_thousands",
       y = "Density")

Output:

gh

Analyzing Car Sales Data in R

This plot effectively visualizes the distribution of car prices in thousands using a combination of a histogram and a density plot overlay.

Outliers Detection for Analyzing Car Sales Data in R

# Set up a larger plotting area
par(mar = c(5, 5, 2, 2))

# Create a boxplot with enhanced features
boxplot(car_data$Price_in_thousands,
        main = "Distribution of Car Prices",
        col = "skyblue",
        border = "black",
        horizontal = TRUE,  # Display as a horizontal boxplot
        notch = TRUE,       # Add a notch for median confidence interval
        notchwidth = 0.5,   # Adjust the width of the notch
        outline = TRUE,     # Show individual data points
        cex.main = 1.2,     # Increase the title font size
        cex.axis = 1.1,     # Increase the axis label font size
        cex.lab = 1.1,      # Increase the axis tick label font size
        ylim = c(0, max(car_data$Price_in_thousands, na.rm = TRUE) * 1.1)  
)

# Add labels and grid
title(xlab = "Price in Thousands", cex.lab = 1.2)
title(ylab = "")
grid(lty = 2, col = "gray", lwd = 0.5)

Output:

gh

Analyzing Car Sales Data in R

Time Series Plot

# Load necessary packages and read the dataset
library(tidyverse)

# Specify the path to the dataset
dataset_path <- "your//path"

# Read the dataset
car_data <- read.csv(dataset_path)

# Create a time series object
ts_data <- ts(car_data$sales, frequency = 1)  

# Time Series Plot
plot(ts_data, main = "Time Series Plot of Sales", xlab = "Date", ylab = "Sales", 
     col = "blue", type = "l")

# Apply a Simple Moving Average (SMA) to smooth the series
sma_window <- 12  # Set the window size for the moving average
sma <- stats::filter(ts_data, rep(1/sma_window, sma_window), sides = 2)

# Overlay the moving average on the plot
lines(sma, col = "red", lwd = 2)

# Add legend
legend("topright", legend = c("Original Sales", paste("SMA (", sma_window, ")", 
                                         sep = "")), col = c("blue", "red"), lty = 1)

Output:

gh

Analyzing Car Sales Data in R

Dataset Loading: The code reads a dataset from a specified path using the read.csv function and assigns it to the car_data variable.

Create a Model for Car Sales Data

# Load necessary packages
library(tidyverse)
library(randomForest)

# Load the car dataset
car_data <- read.csv("your//path")

# Handle missing values if needed
car_data <- na.omit(car_data)

# Split the dataset into training and testing sets
set.seed(123)
train_indices <- sample(1:nrow(car_data), 0.8 * nrow(car_data))
train_data <- car_data[train_indices, ]
test_data <- car_data[-train_indices, ]

# Train a Random Forest regression model with hyperparameter tuning
rf_model <- randomForest(
  sales ~ ., 
  data = train_data,
  ntree = 500,    # Adjust as needed
  mtry = 3,       # Adjust as needed
  nodesize = 5    # Adjust as needed
)

# Make predictions on the test set
predictions <- predict(rf_model, newdata = test_data)

# Evaluate the regression model
rmse <- sqrt(mean((predictions - test_data$sales)^2))
print(paste("Root Mean Squared Error (RMSE):", round(rmse, 4)))

# Mapping Accuracy
mapping_accuracy <- 1 - (rmse / sd(test_data$sales))
print(paste("Mapping Accuracy:", round(mapping_accuracy, 4)))

Output:

[1] "Root Mean Squared Error (RMSE): 99.9132"

[1] "Mapping Accuracy: 0.0485"

Data Splitting: The code divides the car dataset into two parts for training (80%) and testing (20%).

Make Predictions for Car Sales Data

# Now, if you want to predict future values, you can use the trained model
# Let's create a new observation for prediction
new_observation <- data.frame(
  Price_in_thousands = c(25),
  Engine_size = c(2),
  Horsepower = c(150),
  Fuel_efficiency = c(25)
)

# Make predictions on the new observation
future_prediction <- predict(rf_model, newdata = new_observation)

# Display the predicted class
print(future_prediction)

Output:

       1 
17.64733

Predicting Future Values: Now, if you want to predict future car sales using the trained model, you can do that.

Conclusion

Our analysis of the car sales dataset has provided valuable insights into pricing, sales, and market trends. We explored key variables, visualized patterns, addressed outliers, and even implemented a classification model for predictions. Here we use various analytical techniques, we gained a deeper understanding of the automotive market dynamics.

Article Tags :