Open In App

What is synthetic data?

In data science, synthetic data is referred to as artificially generated data that replicates the statistical characteristics and patterns of real-world data. It serves various purposes in data analysis, machine learning, and deep learning. It enables machine learning researchers and data scientists to conduct experiments, test algorithms, and develop models without exposing sensitive or private information. Using algorithms and mathematical models, synthetic data is created to simulate the complexities found in real datasets. It can also be used in existing datasets, especially in cases where the existing data is limited or biased. Furthermore, it facilitates the assessment of model robustness, generalization, and performance under various scenarios.

What is Synthetic Data in Machine Learning?

In machine learning, artificially created data is referred to as “synthetic data,” as opposed to data gathered from actual sources. It mimics the statistical characteristics of authentic data, aiding in model training and testing when real data is limited or sensitive. Techniques such as data augmentation or generative models create synthetic data, enhancing model robustness and performance. Despite its usefulness, ensuring synthetic data accurately represents real-world scenarios is crucial for effective model generalization.



How is synthetic data generated?

Synthetic data is created using algorithms and statistical models that analyze real-world data to identify its underlying patterns and distributions. These patterns are then used to generate new data points that resemble the real data but do not contain any of the original information.

The figure represents the structure of the synthetic data that retains the same structure as the original data but is different from each other.



fig no. 1 Structure of synthetic data with original data

It is generated in various techniques and methods each technique modifies specific data characteristics and based on application requirements.

Measures of synthetic data

Measuring the accuracy of synthetic data is important to ensure its effectiveness in machine learning applications. Here are some the methods which measures the accuracy of synthetic data which are listed below,

  1. Chi-square test is a measure to find the differences between the observed and expected frequencies of values in synthetic and real datasets. Lower chi-square values indicate higher accuracy and vice versa.
  2. Kernel Density Estimation which compares the probability density functions of synthetic and real data using kernel density estimation. Closer match in shapes indicates greater accuracy.
  3. Mean Squared Error which computes the average squared differences between individual data points in synthetic and real datasets. Lower MSE values indicate greater similarity.
  4. Wasserstein Distance which measures the distance between probability distributions by calculating the minimum “cost” of turning one distribution into the other. Lower Wasserstein distance signifies higher accuracy.
  5. Kolmogorov-Smirnov statistic to compare the cumulative distribution functions (CDFs) of the synthetic and real data. A smaller statistic suggests better similarity.

Synthetic Data Generations using Python

There are number of techniques used to generated the synthetic data based on the specific use cases. so here we are going to implement some of the commonly generating a synthetic data using python library.

1. Generating a synthetic data using faker python library.

The Faker library in Python is used to generate realistic, randomized fake data for various purposes such as testing, populating databases, or creating sample datasets. It provides a simple and customizable way to generate fake names, addresses, emails, and other types of data to simulate real-world scenarios in a controlled and privacy-preserving manner.

Step 1: Install faker library by using command:

!pip install faker

Step 2: Load the faker library and generating artificial personal information about the people.




import pandas as pd
from faker import Faker
 
fake = Faker()
 
# Generate a synthetic DataFrame with columns like name, email, and job title of people
data = {'Name': [fake.name() for _ in range(100)],
        'Email': [fake.email() for _ in range(100)],
        'Job': [fake.job() for _ in range(100)]}
 
df = pd.DataFrame(data)
print(df.head())

Output:

           Name                    Email                          Job
0 Jill Morales kemptina@example.org Animal nutritionist
1 Jimmy Lynch nicole46@example.com Nature conservation officer
2 Rachel Dean kenneth32@example.org Buyer, industrial
3 Corey Reid brandilong@example.net Electronics engineer
4 Mark Ramirez lisaperkins@example.com Science writer

2. Generating a synthetic data using Scikit-Learn library.

Scikit-learn are powerful libraries for machine learning. They can be used to generate synthetic datasets with specific characteristics especially for classification problem.

Step 1: Install scikit-learn library by using command:

 !pip install scikit-learn

Step 2: Load the scikit-learn library and generating artificial data for classification problem.




import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
 
# Generate synthetic binary classification data
X, y = make_classification(n_samples=1000, n_features=5, n_informative=3, n_classes=2, random_state=42)
 
# Create a DataFrame from the NumPy arrays
columns = [f"feature_{i}" for i in range(X.shape[1])]
df = pd.DataFrame(data=X, columns=columns)
df['target'] = y
 
# Print the first few rows of the DataFrame
df.head()

Output:

   feature_0  feature_1  feature_2  feature_3  feature_4  target
0 -0.065300 -0.717214 0.393952 -0.934473 1.681514 0
1 0.567015 -0.044606 1.612851 -1.350174 2.488878 0
2 -0.247215 -0.650569 -0.743500 -1.214190 0.841110 0
3 1.145870 0.974224 1.562506 -2.277010 2.276521 1
4 0.599605 -0.427545 2.374472 -1.503510 3.604959 0

3. Generating a synthetic data using bootstrap sampling in Numpy library

Bootstrap sampling is a resampling technique that involves randomly drawing samples with replacement from a dataset. This technique can be generated by using numpy library in python.

Step 1: Install numpy library by using command:

!pip install numpy

Step 2: Load the numpy library and generating artificial numerical integers data.




import numpy as np
import pandas as pd
 
# Original dataset
original_data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
 
# Bootstrap sampling function
def bootstrap_sample(data, num_samples=1000):
    synthetic_data = np.random.choice(data, size=(num_samples, len(data)), replace=True)
    return synthetic_data
 
# Generate synthetic data
synthetic_data = bootstrap_sample(original_data)
 
# Create a DataFrame from the synthetic data
num_columns = synthetic_data.shape[1]
column_names = [f'Column_{i+1}' for i in range(num_columns)]
synthetic_df = pd.DataFrame(synthetic_data, columns=column_names)
 
# Print the first few rows of the synthetic DataFrame
synthetic_df.head()

Output:

   Column_1  Column_2  Column_3  Column_4  Column_5  Column_6  Column_7  \
0 9 8 6 9 5 2 10
1 9 7 6 5 10 9 8
2 6 3 1 3 1 10 1
3 8 10 4 3 4 3 9
4 4 6 6 6 10 9 2
Column_8 Column_9 Column_10
0 2 6 2
1 7 6 2
2 4 1 5
3 8 1 4
4 4 4 6

4. Generating synthetic data using Gaussian statistical model in Numpy library.

Gaussian statistical models are used to generate synthetic data by assuming a normal distribution, defined by mean and standard deviation parameters. This can be achieved through using Numpy library in python

Step 1: Install numpy library by using command:

 !pip install numpy

Step 2: Load the numpy library and generating artificial gaussian distributed data.




import numpy as np
import pandas as pd
 
# Parameters for the normal distribution (mean and standard deviation)
means = [5, 10, 15, 20, 25
std_devs = [2, 2, 2, 2, 2
 
# Generate synthetic data points using random sampling from a gaussian distribution for each input variable
num_samples = 1000
synthetic_data = np.random.normal(means, std_devs, size=(num_samples, len(means)))
 
# Create a DataFrame from the synthetic data
column_names = [f'Input_{i+1}' for i in range(len(means))] + ['Output']
synthetic_df = pd.DataFrame(data=np.hstack([synthetic_data, synthetic_data.sum(axis=1, keepdims=True)]),
                            columns=column_names)
 
# Print the first few rows of the synthetic DataFrame
synthetic_df.head()

Output:

    Input_1    Input_2    Input_3    Input_4    Input_5     Output
0 4.780425 13.162034 16.231576 21.169664 26.686617 82.030316
1 7.178791 8.808152 14.307585 14.125770 25.779506 70.199805
2 5.036429 9.160681 13.157213 20.185858 25.250525 72.790706
3 6.982617 8.553333 12.821710 20.714012 24.872681 73.944353
4 3.843489 9.966772 14.924961 20.377723 29.385274 78.498219

Application of synthetic data in machine learning

Synthetic data plays a crucial role in few aspect in the process of machine learning which are listed below,

  1. It helps in data augmentation process by augmenting existing datasets, helping improve model performance by providing additional diverse examples for training.
  2. Synthetic data aids anomaly detection by creating realistic outliers and anomalies, allowing machine learning models to better identify and handle unexpected patterns or anomalies in real-world data.
  3. synthetic data allows to reduce biases present in real data can be addressed, promoting fairness and reducing algorithmic bias in machine learning models.
  4. Synthetic data supplements real-world datasets, providing additional examples for model training, especially in scenarios where obtaining sufficient authentic data is challenging.
  5. Synthetic data is valuable for simulating rare or extreme events, allowing machine learning models to be trained effectively on scenarios that may have limited occurrences in real-world data.

Limitation of synthetic data

While synthetic data offers numerous advantages in machine learning, it also has certain limitations that need to be considered which are listed below,

  1. Synthetic data may lead to overfitting, where the model performs well on the synthetic data but poorly on real-world data. This is because the synthetic data may not adequately represent the full range of real-world data variations.
  2. Creating accurate and realistic synthetic data often requires domain expertise to understand the underlying patterns and relationships in the real data.
  3. Generating high-quality synthetic data can be computationally expensive, especially for complex data types like images or natural language.
  4. While synthetic data can protect privacy by not using real-world data, there is still a risk of re-identifying individuals based on the synthetic data, especially if it contains sensitive attributes.

Conclusion

Finally we conclude, synthetic data proves to be a valuable asset in the field of machine learning, offering solutions to data scarcity, privacy concerns, and biased datasets. Through various generation methods such as random sampling, bootstrapping, and advanced techniques like GANs, synthetic data replicates the statistical characteristics of real-world data, enabling researchers and practitioners to conduct experiments, test algorithms, and develop models without compromising sensitive information. While synthetic data introduces cost-effectiveness, bias mitigation, and privacy preservation, it is not without limitations. Careful consideration of its potential for overfitting, the need for domain expertise in generation, computational costs, and potential re-identification risks is essential.

Frequently Asked Questions on Synthetic Data

1. Why do we need synthetic data?

Synthetic data is essential when obtaining real-world data is difficult or risky due to privacy concerns. It provides a substitute for training machine learning models, ensuring robust performance in scenarios with limited or sensitive data.

2. Can synthetic data replace real data?

Synthetic data cannot completely replace real data in all situations. While it has advantages, such as privacy preservation and addressing data scarcity, synthetic data may not fully capture the complexity and variability of real-world scenarios.

3. When was synthetic data invented?

In 1993, statistician Donald Rubin proposed the concept of “synthetic data” as a way to protect privacy in statistical analysis. He introduced the idea of generating artificial data that preserves the statistical properties of real data without revealing any confidential information.

4. Who uses synthetic data?

Synthetic data is utilized by data scientists for model training, especially in scenarios involving privacy concerns or limited datasets. Industries such as healthcare, finance, and autonomous vehicles leverage synthetic data to develop and test algorithms without compromising sensitive information.

5. What is the difference between original data and synthetic data?

Original data comes from real-world observations, capturing authentic information with potential privacy concerns. Synthetic data is artificially generated, mimicking real data patterns but lacking full authenticity.

6. How many types of synthetic data are there?

Synthetic data can be generated through methods like random sampling, parametric models, and neural networks like Generative Adversarial Networks (GANs). Rule-based systems, copulas, and domain-specific simulators are additional approaches, providing diverse options for creating artificial datasets in various applications.


Article Tags :