Sklearn Diabetes Dataset : Scikit-learn Toy Datasets in Python

Last Updated : 29 Apr, 2024

The Sklearn Diabetes Dataset typically refers to a dataset included in the scikit-learn machine learning library, which is a synthetic dataset rather than real-world data. This dataset is often used for demonstration purposes in machine learning tutorials and examples. In this article, we are going to learn more about the Sklearn Diabetes Dataset, how to load the dataset, and its application in machine learning.

What is a Diabetes Dataset?

The Diabetes Dataset is a dataset used by researchers to employ statistical analysis or machine learning algorithms to uncover Diabetes patterns in patients. The Sklearn Diabetes Dataset is a rich source of information for the application of machine learning algorithms in healthcare analytics.

What is the Sklearn Diabetes Dataset?

The scikit-learn Diabetes Dataset or Sklearn Diabetes dataset consists of ten baseline variables, such as age, sex, body mass index (BMI), average blood pressure, and six blood serum measurements, obtained for 442 diabetes patients. The target variable is a quantitative measure of disease progression one year after baseline.

Sklearn Diabetes Dataset

Characteristics of Sklearn Diabetes Dataset

Number of Instances: 442
Number of Attributes: The first 10 columns are numeric predictive values.
Target: Column 11 represents a quantitative measure of disease progression one year after baseline.

Attribute Information of Sklearn Diabetes Dataset

The Sklearn Diabetes Dataset include following attributes:

age: Age in years
sex: Gender of the patient
bmi: Body mass index
bp: Average blood pressure
s1: Total serum cholesterol (tc)
s2: Low-density lipoproteins (ldl)
s3: High-density lipoproteins (hdl)
s4: Total cholesterol / HDL (tch)
s5: Possibly log of serum triglycerides level (ltg)
s6: Blood sugar level (glu)

How to Load Sklearn Diabetes Dataset?

The sklearn.datasets.load_diabetes function is used to load the Diabetes Dataset available in scikit-learn.

Syntax: sklearn.datasets.load_diabetes(*, return_X_y=False, as_frame=False, scaled=True)

Here’s what each parameter does:

return_X_y: If set to True, the function returns the features (X) and the target labels (y) as separate arrays. If False (default), it returns a Bunch object containing both data and target.
as_frame: If set to True, the data is returned as a pandas DataFrame. If False (default), it returns a numpy array or a Bunch object depending on the value of return_X_y.
scaled: If set to True (default), the features are scaled to have zero mean and unit variance. If False, the features are returned in their original scale.

Example:

from sklearn.datasets import load_diabetes

# Load the diabetes dataset
diabetes_sklearn = load_diabetes()

How to load Diabetes Dataset using Sklearn

Using the following code, we will load the Sklearn Diabetes Dataset and print the shape of the dataset.

Python3

import pandas as pd
from sklearn.datasets import load_diabetes

# Load the diabetes dataset
diabetes_sklearn = load_diabetes()

# Convert the dataset to a DataFrame
diabetes_df = pd.DataFrame(data=diabetes_sklearn.data,
                           columns=diabetes_sklearn.feature_names)

# Add target variable to the DataFrame
diabetes_df['target'] = diabetes_sklearn.target

print(diabetes_df.head())

# Print the shape of the feature matrix and target vector
print("Shape of Sklearn Diabetes Data:", diabetes_df.shape)

Output:

        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

         s4        s5        s6  target  
0 -0.002592  0.019907 -0.017646   151.0  
1 -0.039493 -0.068332 -0.092204    75.0  
2 -0.002592  0.002861 -0.025930   141.0  
3  0.034309  0.022688 -0.009362   206.0  
4 -0.002592 -0.031988 -0.046641   135.0 
 
Shape of Sklearn Diabetes Data: (442, 11)

The output confirms that the dataset contain the information of 442 individuals with 10 features of each individual and one target column.

Applications of Sklearn Diabetes Dataset

The diabetes dataset from scikit-learn is commonly used in machine learning tutorials to practice regression techniques. Some of its uses are:

Regression Modeling: Regression modeling involves using algorithms such as linear regression, ridge regression, Lasso regression, and decision tree regression to predict disease progression based on patient characteristics.
Feature Selection and Engineering: Experimenting with feature selection techniques to find relevant features for prediction.
Model Evaluation and Comparison: Evaluating and comparing models involves dividing the dataset into training and testing sets, training various regression models, and assessing their performance using metrics such as MSE, R-squared, and MAE.
Predictive Analytics in Healthcare: Even though the scikit-learn diabetes dataset is not derived from actual patient information, the regression models built on this dataset can offer valuable information about predictive analytics in the healthcare field. Healthcare professionals can use the same methods on actual datasets to forecast the advancement of diseases, evaluate the risk level of patients, and tailor treatment strategies for individuals with diabetes.
Teaching and Learning: Commonly utilized in academic environments to teach regression principles, feature selection, model assessment, and various machine learning activities.

Conclusion

Overall, the Sklearn Diabetes Dataset provided by scikit-learn facilitates research and analysis in the field of diabetes management and predictive healthcare. By leveraging machine learning techniques and exploring the relationships between baseline variables and disease progression, researchers can develop valuable insights and predictive models that contribute to the advancement of diabetes treatment and patient care.

FAQ on Sklearn Diabetes Datasets

What is the Sklearn Diabetes Dataset in scikit-learn?

The Sklearn Diabetes Dataset in scikit-learn is a dataset containing ten baseline variables and a quantitative measure of disease progression in 442 diabetes patients.

What are the features included in the Sklearn Diabetes Dataset?

The dataset includes ten baseline variables: age, sex, body mass index (BMI), average blood pressure, and six blood serum measurements.

What is the target variable in the Sklearn Diabetes Dataset?

The target variable represents a quantitative measure of disease progression one year after baseline for each patient.

How many instances are there in the Sklearn Diabetes Dataset?

The dataset consists of 442 instances, each representing a different diabetes patient.

How can I access the Sklearn Diabetes Dataset using scikit-learn?

You can load the dataset using the load_diabetes function from the sklearn.datasets module.

Can I obtain the features and target labels separately?

Yes, by setting the parameter return_X_y to True, you can obtain the features (X) and the target labels (y) as separate arrays.

Can I load the dataset as a pandas DataFrame?

Yes, by setting the parameter as_frame to True, you can load the dataset as a pandas DataFrame.

Suggest improvement

Top 10 Python Libraries for Data Science in 2024

What is Devin AI ?

Share your thoughts in the comments