Open In App

Breast Cancer Wisconsin (Diagnostic) Dataset

The Breast Cancer Wisconsin (Diagnostic) dataset is a renowned collection of data used extensively in machine learning and medical research. Originating from digitized images of fine needle aspirates (FNA) of breast masses, this dataset facilitates the analysis of cell nuclei characteristics to aid in the diagnosis of breast cancer. In this article, we delve into the attributes, statistics, and significance of this dataset.

Understanding Breast Cancer Wisconsin (diagnostic) Dataset

The Breast Cancer Wisconsin (Diagnostic) dataset is a well-known dataset commonly used in machine learning. The dataset was curated by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian. It contains features computed from digitized images of fine needle aspirate (FNA) samples of breast mass tissue.

Breast-Cancer-Wisconsin-(Diagnostic)-Dataset

Breast Cancer Wisconsin (Diagnostic) Dataset

Characteristics of Breast Cancer Wisconsin (diagnostic) Dataset

  1. Number of Instances: 569
  2. Number of Attributes: 30 numerical attributes used for prediction, along with a class label.
  3. Class Distribution: 212 - Malignant, 357 - Benign

Attributes of Breast Cancer Wisconsin (diagnostic) Dataset

The dataset comprises 30 features, including mean, standard error, and "worst" or largest values, computed for each image. These features encapsulate various aspects of cell nuclei characteristics:

  1. mean radius: Mean of distances from center to points on the perimeter.
  2. mean texture: Standard deviation of gray-scale values.
  3. mean perimeter: Perimeter of the tumor.
  4. mean area: Area of the tumor.
  5. mean smoothness: Variation in radius lengths.
  6. mean compactness: Perimeter^2 / Area - 1.0.
  7. mean concavity: Severity of concave portions of the contour.
  8. mean concave points: Number of concave portions of the contour.
  9. mean symmetry: Symmetry of the cell nuclei.
  10. mean fractal dimension: "Coastline approximation" - 1

Classes

2

Samples per class

212(M),357(B)

Samples total

569

Dimensionality

30

Features

real, positive

How to load Breast cancer wisconsin (diagnostic) dataset?

The sklearn.datasets.load_breast_cancer function is used to load the Breast Cancer Wisconsin dataset.

Syntax: sklearn.datasets.load_breast_cancer(*, return_X_y=False, as_frame=False)

Here's what each parameter does:

  • return_X_y:
    • When set to True: The function provides the features (X) and the target labels (y) as distinct arrays.
    • When set to False (default): The function returns a Bunch object containing both the data and target labels together.
  • as_frame:
    • When set to True: The data is returned in the form of a pandas DataFrame.
    • When set to False (default): The data is returned as either a numpy array or a Bunch object, depending on the value of return_X_y.

Loading Breast Cancer Dataset using Sklearn

We will be loading the breast cancer dataset from sklearn, by converting it into a pandas DataFrame, and then displaying the first few rows.

import pandas as pd
from sklearn.datasets import load_breast_cancer

# Load breast cancer dataset from sklearn
data = load_breast_cancer()

# Convert the dataset to a pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)

# Add the target variable to the DataFrame
df['target'] = data.target

# Display the DataFrame
print(df.head())

Output:

  mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0 17.99 10.38 122.80 1001.0 0.11840
1 20.57 17.77 132.90 1326.0 0.08474
2 19.69 21.25 130.00 1203.0 0.10960
3 11.42 20.38 77.58 386.1 0.14250
4 20.29 14.34 135.10 1297.0 0.10030

mean compactness mean concavity mean concave points mean symmetry \
0 0.27760 0.3001 0.14710 0.2419
1 0.07864 0.0869 0.07017 0.1812
2 0.15990 0.1974 0.12790 0.2069
3 0.28390 0.2414 0.10520 0.2597
4 0.13280 0.1980 0.10430 0.1809

mean fractal dimension ... worst texture worst perimeter worst area \
0 0.07871 ... 17.33 184.60 2019.0
1 0.05667 ... 23.41 158.80 1956.0
2 0.05999 ... 25.53 152.50 1709.0
3 0.09744 ... 26.50 98.87 567.7
4 0.05883 ... 16.67 152.20 1575.0

worst smoothness worst compactness worst concavity worst concave points \
0 0.1622 0.6656 0.7119 0.2654
1 0.1238 0.1866 0.2416 0.1860
2 0.1444 0.4245 0.4504 0.2430
3 0.2098 0.8663 0.6869 0.2575
4 0.1374 0.2050 0.4000 0.1625

worst symmetry worst fractal dimension target
0 0.4601 0.11890 0
1 0.2750 0.08902 0
2 0.3613 0.08758 0
3 0.6638 0.17300 0
4 0.2364 0.07678 0

[5 rows x 31 columns]

Significance of Sklearn Breast Cancer Wisconsin (Diagnostic) Dataset in Machine Learning

The dataset's significance lies in its utility for breast cancer diagnosis and prognosis. By analyzing features extracted from FNA images, medical practitioners and researchers can develop models for automated or assisted diagnosis of breast cancer. Features such as texture, smoothness, and concavity play crucial roles in distinguishing between malignant and benign tumors.

  1. Binary Classification: The primary application of this dataset is binary classification, where machine learning models are trained to predict whether a breast tumor is malignant (cancerous) or benign (non-cancerous) based on features extracted from digitized images of fine needle aspirate (FNA) samples. Algorithms such as logistic regression, support vector machines (SVM), decision trees, random forests, k-nearest neighbors (KNN), and neural networks can be applied to this dataset to build classifiers.
  2. Feature Selection: Researchers and practitioners often use this dataset to explore feature selection techniques. They may experiment with different methods to identify the most informative features for predicting breast cancer, which can lead to more efficient models and insights into the underlying factors contributing to cancer diagnosis.
  3. Model Evaluation and Comparison: The dataset serves as a benchmark for evaluating the performance of different machine learning algorithms. Practitioners can compare the accuracy, precision, recall, F1-score, and other metrics of classifiers trained on this dataset to determine which algorithms perform best for breast cancer diagnosis.
  4. Hyperparameter Tuning: Machine learning models typically have hyperparameters that need to be optimized for better performance. Practitioners can use the Breast Cancer Wisconsin dataset to tune hyperparameters using techniques such as grid search or randomized search to improve model accuracy and generalization.

FAQ on Breast Cancer Wisconsin (Diagnostic) Dataset

What is the Breast Cancer Wisconsin (Diagnostic) dataset?

The Breast Cancer Wisconsin (Diagnostic) dataset is a collection of data regarding breast cancer tumors. It contains features computed from digitized images of fine needle aspirates (FNA) of breast masses.

What is the purpose of the dataset?

The dataset is commonly used for binary classification tasks, where the goal is to predict whether a tumor is malignant (cancerous) or benign (non-cancerous) based on the provided features.

What are the features in the dataset?

The dataset contains 30 numeric, predictive attributes derived from the images of the breast cancer tumors. These features include measures such as radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.

How many instances are there in the dataset?

The dataset consists of 569 instances, each representing a different breast cancer tumor.

What is the format of the target variable?

The target variable represents the diagnosis of the tumor and is binary. It has two classes: M (malignant) and B (benign).

Article Tags :