Open In App

Diabetes Prediction using R

In this article, we will learn how to predict whether a person has diabetes or not using the Diabetes dataset. This is a classification problem, thus we're utilizing a Logistic regression in R Programming Language.

Here are the main steps for this project:

  1. Load the dataset
  2. Analyze the data
  3. Exploratory data analysis(EDA)
  4. Preprocessing
  5. Training model
  6. Evaluate model
  7. Make prediction

Overview of dataset

The dataset utilized is the "diabetes.csv" dataset, which presumably contains diabetes-related information. Pregnancies, glucose levels, blood pressure, skin thickness, insulin levels, BMI (Body Mass Index), diabetes pedigree function, and age are among the factors considered. The result variable, which most likely represents the presence or absence of diabetes, is labeled "Outcome."

  1. Pregnancies: Number of times pregnant
  2. Glucose: two hours following an oral glucose tolerance test, plasma glucose concentration
  3. BloodPressure: Diastolic blood pressure (mm Hg)
  4. SkinThickness: Triceps skinfold thickness (mm)
  5. Insulin: 2-Hour serum insulin (mu U/ml)
  6. BMI: Body mass index (weight in kg/(height in m)^2)
  7. DiabetesPedigreeFunction: Based on family history, it is a numerical score that determines the genetic risk of diabetes. It considers the prevalence of diabetes among relatives to assess an individual's likelihood of developing the condition. A higher DPF score indicates a greater genetic predisposition to diabetes.
  8. Age: Age (years)
  9. Outcome: Class variable (0 or 1)

DataSet Link: Diabetes Dataset

Load Dataset and required libraries

The code includes three R libraries for data manipulation, splitting, and modeling: "readr" for reading data, "caTools" for data splitting, and "e1071" for machine learning techniques. These libraries are widely used in data analysis and predictive modeling applications. "readr" rapidly reads data files, "caTools" provides utilities for partitioning data into training and testing sets, and "e1071" implements different machine learning algorithms, including Logistic regression. Together, these tools form a solid foundation for data preprocessing, model construction, and evaluation in R programming.

library(readr)
library(caTools)
library(caret)
library(e1071)

data <- read.csv("diabetes.csv")
head(data)

Output:

  Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI DiabetesPedigreeFunction Age Outcome
1 6 148 72 35 0 33.6 0.627 50 1
2 1 85 66 29 0 26.6 0.351 31 0
3 8 183 64 0 0 23.3 0.672 32 1
4 1 89 66 23 94 28.1 0.167 21 0
5 0 137 40 35 168 43.1 2.288 33 1
6 5 116 74 0 0 25.6 0.201 30 0

The head command will print the top 6 rows of data. After running head(data) command, the table will be printed on console.

This dataset contains eight features, each of which determines a result of 0 or 1. A score of 0 means the patient does not have diabetes, whereas a score of 1 means they do.

Analyze the data

To view the dataset's statistics, including the mean, median, lowest value, greatest value, etc., we will now use the summary command. Next, we'll look for the missing value and drop it if it exists.

# Check data summary
summary(data)

# Check for missing values
colSums(is.na(data))

Output:

  Pregnancies        Glucose      BloodPressure    SkinThickness      Insulin           BMI       
Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.00
1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.:27.30
Median : 3.000 Median :117.0 Median : 72.00 Median :23.00 Median : 30.5 Median :32.00
Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54 Mean : 79.8 Mean :31.99
3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00 3rd Qu.:127.2 3rd Qu.:36.60
Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00 Max. :846.0 Max. :67.10
DiabetesPedigreeFunction Age Outcome
Min. :0.0780 Min. :21.00 Min. :0.000
1st Qu.:0.2437 1st Qu.:24.00 1st Qu.:0.000
Median :0.3725 Median :29.00 Median :0.000
Mean :0.4719 Mean :33.24 Mean :0.349
3rd Qu.:0.6262 3rd Qu.:41.00 3rd Qu.:1.000
Max. :2.4200 Max. :81.00 Max. :1.000

Pregnancies Glucose BloodPressure SkinThickness
0 0 0 0
Insulin BMI DiabetesPedigreeFunction Age
0 0 0 0
Outcome
0

It's great that after reviewing our dataset, we didn't find any null values. Even better, not a single attribute is categorical—all of them are numerical.

Moving on, let's use exploratory data analysis (EDA) to find connections between the attributes.

Data Visualization

Now we will create some visualization for this dataset to get some informations.

Correlation Heatmap

# import libraries
library(ggplot2)
library(reshape2)

correlation_matrix <- cor(data)

# Convert correlation matrix to long format
correlation_melted <- melt(correlation_matrix)

# Plot heatmap
ggplot(correlation_melted, aes(Var1, Var2, fill=value)) +
  geom_tile(color="white") +
  scale_fill_gradient2(low="blue", high="red", mid="white", midpoint=0, 
                       limit=c(-1,1), space="Lab", name="Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title="Correlation Heatmap", x="Features", y="Features")

Output:

gh

Predicting Diabetes Risk in R

Distribution of diabetes outcome

outcome_counts <- table(data$Outcome)
outcome_df <- data.frame(Outcome = names(outcome_counts), 
                         Count = as.numeric(outcome_counts))

# Create bar plot
ggplot(outcome_df, aes(x=Outcome, y=Count)) +
  geom_bar(stat="identity", fill="pink") +
  labs(title="Distribution of Diabetes Outcomes", x="Outcome", y="Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12),
        axis.title = element_text(size=12),
        plot.title = element_text(size=16, face="bold"))

Output:

gh

Predicting Diabetes Risk in R

Histograms with Outcome Split

library(ggplot2)

# Select relevant columns
diabetes_subset <- data[, c("Pregnancies", "Glucose", "BloodPressure", 
                                "BMI", "Age", "Outcome")]

# Histograms
ggplot(diabetes_subset, aes(x = Pregnancies, fill = factor(Outcome))) +
  geom_histogram(position = "identity", bins = 30, alpha = 0.7) +
  labs(title = "Distribution of Pregnancies by Outcome") +
  facet_wrap(~Outcome, scales = "free_y") +
  theme_minimal()

Output:

gh

Predicting Diabetes Risk in R


Here we Compare the distribution of numerical features for positive and negative outcomes.

Boxplot for BMI by Outcome

library(ggplot2)

# Boxplot
ggplot(data, aes(x = factor(Outcome), y = BMI, fill = factor(Outcome))) +
  geom_boxplot() +
  labs(title = "BMI Distribution by Outcome") +
  theme_minimal()

Output:

gh

Predicting Diabetes Risk in R

Visualize the distribution of BMI for each outcome.

Data Preprocessing

  1. Split data into X and y.
  2. Scale features(X).
  3. Bind scaled X with y
  4. Split scaled data into X and y
  5. Initialize the random number generator with 123 as seed value.
  6. Divide the X and y into two parts: 70% for training and 30% for testing.
# Split data into X and y
X <- data[, 1:8]
y <- data[, 9]

# Scale only the features (X)
scaled_X <- as.data.frame(scale(X))

# Bind scaled X with y
scaled_data <- cbind(scaled_X, y)


# Split scaled data into X and y
X <- scaled_data[, 1:8]
y <- scaled_data[, 9]

# Split X and y into training and testing sets
set.seed(123)
sample <- sample.split(y, SplitRatio = 0.7)
X_train <- X[sample == TRUE, ]
y_train <- y[sample == TRUE]
X_test <- X[sample == FALSE, ]
y_test <- y[sample == FALSE]

Separate data into X and Y: This code divides the dataset into two pieces, X (features) and Y (target variable). X comprises the dataset's first eight columns, which are referred to as features. x contains the dataset's ninth column, which represents the goal variable (Outcome).

Model Training and Evaluate

We are using Logistic regression for classification task in this project. Use below command to train the model.

log_model <- glm(y_train ~ ., data = X_train, family = binomial)
# Make predictions
predictions <- predict(log_model, newdata = X_test, type = "response")

# Convert predicted outcome to factor with levels matching actual outcome
predictions <- factor(ifelse(predictions > 0.5, 1, 0), 
                      levels = levels(as.factor(y_test)))

# Generate confusion matrix
confusionMatrix(predictions, as.factor(y_test))

Output:

Confusion Matrix and Statistics

Reference
Prediction 0 1
0 127 37
1 23 43

Accuracy : 0.7391
95% CI : (0.6773, 0.7946)
No Information Rate : 0.6522
P-Value [Acc > NIR] : 0.002949

Kappa : 0.4005

Mcnemar's Test P-Value : 0.093290

Sensitivity : 0.8467
Specificity : 0.5375
Pos Pred Value : 0.7744
Neg Pred Value : 0.6515
Prevalence : 0.6522
Detection Rate : 0.5522
Detection Prevalence : 0.7130
Balanced Accuracy : 0.6921

'Positive' Class : 0

This code uses the training data (X_train and Y_train) to train a logistic regression model (log_model). The formula y_train ~. models the target variable as a function of all variables in the training data (X_train). The family = binomial argument indicates that the logistic regression model is used for binary classification, with the response variable having a binomial distribution. Based on the input feature values, the logistic regression model determines the chance that the target variable belongs to a specific class.

This code applies the trained logistic regression model ('log_model') to the test data ('X_test') to create predictions.

Make a prediction

predict_diabetes <- function(pregnancies, glucose, bloodpressure, skinthickness, 
                             insulin, bmi, diabetespedigreefunction, age) {
  input_data <- data.frame(
    Pregnancies = pregnancies,
    Glucose = glucose,
    BloodPressure = bloodpressure,
    SkinThickness = skinthickness,
    Insulin = insulin,
    BMI = bmi,
    DiabetesPedigreeFunction = diabetespedigreefunction,
    Age = age
  )
  input <- as.data.frame(input_data)
  prediction <- predict(log_model, newdata = input, type = "response")
  prediction <- factor(ifelse(prediction > 0.5, 1, 0), 
                       levels = levels(as.factor(prediction)))
  
  return(prediction)
}

new_patient <- data.frame(
  pregnancies = 6,
  glucose = 148,
  bloodpressure = 72,
  skinthickness = 35,
  insulin = 0,
  bmi = 33.6,
  diabetespedigreefunction = 0.627,
  age = 50
)

prediction <- predict_diabetes(
  new_patient$pregnancies,
  new_patient$glucose,
  new_patient$bloodpressure,
  new_patient$skinthickness,
  new_patient$insulin,
  new_patient$bmi,
  new_patient$diabetespedigreefunction,
  new_patient$age
)

if (any(prediction == 1)) {
  cat("Based on the model's prediction, there is a higher chance of diabetes.")
} else {
  cat("Based on the model's prediction, the risk of diabetes appears lower.")
}

Output:

Based on the model's prediction, there is a higher chance of diabetes.

This code defines the function predict_diabetes, which predicts whether a new patient has diabetes depending on their medical characteristics. The program accepts input variables such as pregnancy, glucose level, blood pressure, skin thickness, insulin level, BMI, diabetes pedigree function, and age. It creates a data frame from these inputs and applies the pre-trained logistic regression model (log_model) to forecast the new patient's diabetes risk.

Article Tags :