# Generate Test Datasets for Machine learning

Last Updated : 11 Apr, 2023

Whenever we think of Machine Learning, the first thing that comes to our mind is a dataset. While there are many datasets that you can find on websites such as Kaggle, sometimes it is useful to extract data on your own and generate your own dataset. Generating your own dataset gives you more control over the data and allows you to train your machine-learning model.  In this article, we will generate random datasets using sklearn.datasets library in Python.

### Generate test datasets for Classification:

#### Binary Classification

Example 1: The 2d binary classification data generated by make_circles() have a spherical decision boundary.

## Python3

 `# Import necessary libraries` `from` `sklearn.datasets ``import` `make_circles` `import` `matplotlib.pyplot as plt`   `# Generate 2d classification dataset ` `X, y ``=` `make_circles(n_samples``=``200``, shuffle``=``True``, ` `                    ``noise``=``0.1``, random_state``=``42``)` `# Plot the generated datasets` `plt.scatter(X[:, ``0``], X[:, ``1``], c``=``y)` `plt.show()`

Output:

make_circles()

Example 2: Two interlocking half circles represent the 2d binary classification data produced by the make_moons() function.

## Python3

 `#import the necessary libraries` `from` `sklearn.datasets ``import` `make_moons` `import` `matplotlib.pyplot as plt` `# generate 2d classification dataset` `X, y ``=` `make_moons(n_samples``=``500``, shuffle``=``True``,` `                  ``noise``=``0.15``, random_state``=``42``)` `# Plot the generated datasets` `plt.scatter(X[:, ``0``], X[:, ``1``], c``=``y)` `plt.show()`

Output:

make_moons()

### Multi-Class Classification

Example 1: Data generated by the function make_blobs() are blobs that can be utilized for clustering.

## Python3

 `#import the necessary libraries` `from` `sklearn.datasets ``import` `make_blobs` `import` `matplotlib.pyplot as plt`   `# Generate 2d classification dataset` `X, y ``=` `make_blobs(n_samples``=``500``, centers``=``3``, n_features``=``2``, random_state``=``23``)`   `# Plot the generated datasets` `plt.scatter(X[:, ``0``], X[:, ``1``], c``=``y)` `plt.show()`

Output:

make_blobs()

Example 2: To generate data by the function make_classification() need to balance between n_informative, n_redundant and n_classes attributes X[:, :n_informative + n_redundant + n_repeated]

## Python3

 `#import the necessary libraries` `from` `sklearn.datasets ``import` `make_classification` `import` `matplotlib.pyplot as plt`   `# generate 2d classification dataset` `X, y ``=` `make_classification(n_samples ``=` `100``, ` `                           ``n_features``=``2``,` `                           ``n_redundant``=``0``,` `                           ``n_informative``=``2``,` `                           ``n_repeated``=``0``,` `                           ``n_classes ``=``3``,` `                           ``n_clusters_per_class``=``1``)`   `# Plot the generated datasets` `plt.scatter(X[:, ``0``], X[:, ``1``], c``=``y)` `plt.show()`

Output:

make_classification()

Example 3:A random multi-label classification data is created by the function make make_multilabel_classification()

## Python3

 `# Import necessary libraries` `from` `sklearn.datasets ``import` `make_multilabel_classification` `import` `pandas as pd` `import` `matplotlib.pyplot as plt`   `# Generate 2d classification dataset ` `X, y ``=` `make_multilabel_classification(n_samples``=``500``, n_features``=``2``, ` `                                      ``n_classes``=``2``, n_labels``=``2``,` `                                      ``allow_unlabeled``=``True``,` `                                      ``random_state``=``23``)` `# create pandas dataframe from generated dataset` `df ``=` `pd.concat([pd.DataFrame(X, columns``=``[``'X1'``, ``'X2'``]), ` `                ``pd.DataFrame(y, columns``=``[``'Label1'``, ``'Label2'``])],` `               ``axis``=``1``)` `display(df.head())`   `# Plot the generated datasets` `plt.scatter(df[``'X1'``], df[``'X2'``], c``=``df[``'Label1'``])` `plt.show()`

Output:

```    X1    X2    Label1    Label2
0    14.0    34.0    0    1
1    30.0    22.0    1    1
2    29.0    19.0    1    1
3    21.0    19.0    1    1
4    16.0    32.0    0    1```

make_multilabel_classification()

### Generate test datasets for Regression:

Example 1:  Generate a 1-dimensional feature and target for linear regression using make_regression

## Python3

 `# Import necessary libraries` `from` `sklearn.datasets ``import` `make_regression` `import` `matplotlib.pyplot as plt` `# Generate 1d Regression dataset ` `X, y ``=` `make_regression(n_samples ``=` `50``, n_features``=``1``,noise``=``20``, random_state``=``23``)` `# Plot the generated datasets` `plt.scatter(X, y)` `plt.show()`

Output:

make_regression()

## Python3

 `# Import necessary libraries` `from` `sklearn.datasets ``import` `make_sparse_uncorrelated` `import` `matplotlib.pyplot as plt` `# Generate 1d Regression dataset ` `X, y ``=` `make_sparse_uncorrelated(n_samples ``=` `100``, n_features``=``4``, random_state``=``23``)` `# Plot the generated datasets` `plt.figure(figsize``=``(``12``,``10``))` `for` `i ``in` `range``(``4``):` `    ``plt.subplot(``2``,``2``, i``+``1``)` `    ``plt.scatter(X[:,i], y)` `    ``plt.xlabel(``'X'``+``str``(i``+``1``))` `    ``plt.ylabel(``'Y'``)` `plt.show()`

Output:

make_sparse_uncorrelated()

## Python3

 `# Import necessary libraries` `from` `sklearn.datasets ``import` `make_friedman2` `import` `matplotlib.pyplot as plt` `# Generate 1d Regression dataset ` `X, y ``=` `make_friedman2(n_samples ``=` `100``, random_state``=``23``)` `# Plot the generated datasets` `plt.figure(figsize``=``(``12``,``10``))` `for` `i ``in` `range``(``4``):` `    ``plt.subplot(``2``,``2``, i``+``1``)` `    ``plt.scatter(X[:,i], y)` `    ``plt.xlabel(``'X'``+``str``(i``+``1``))` `    ``plt.ylabel(``'Y'``)` `plt.show()`

Output:

make_friedman2()

Previous
Next