What is synthetic data?

In data science, synthetic data is referred to as artificially generated data that replicates the statistical characteristics and patterns of real-world data. It serves various purposes in data analysis, machine learning, and deep learning. It enables machine learning researchers and data scientists to conduct experiments, test algorithms, and develop models without exposing sensitive or private information. Using algorithms and mathematical models, synthetic data is created to simulate the complexities found in real datasets. It can also be used in existing datasets, especially in cases where the existing data is limited or biased. Furthermore, it facilitates the assessment of model robustness, generalization, and performance under various scenarios.

What is Synthetic Data in Machine Learning?

In machine learning, artificially created data is referred to as “synthetic data,” as opposed to data gathered from actual sources. It mimics the statistical characteristics of authentic data, aiding in model training and testing when real data is limited or sensitive. Techniques such as data augmentation or generative models create synthetic data, enhancing model robustness and performance. Despite its usefulness, ensuring synthetic data accurately represents real-world scenarios is crucial for effective model generalization.

How is synthetic data generated?

Synthetic data is created using algorithms and statistical models that analyze real-world data to identify its underlying patterns and distributions. These patterns are then used to generate new data points that resemble the real data but do not contain any of the original information.

The figure represents the structure of the synthetic data that retains the same structure as the original data but is different from each other.

fig no. 1 Structure of synthetic data with original data

It is generated in various techniques and methods each technique modifies specific data characteristics and based on application requirements.

Random sampling technique to generate synthetic data:
Data generation through random sampling involves the creation of data points by randomly selecting values according to statistical distributions observed in real-world data. This method, while straightforward, has limitations in accurately representing the intricate interdependencies found in authentic datasets. Real-world data often exhibits specific statistical patterns, such as Gaussian distributions or skewed structures, which may not be fully captured by the simplicity of random sampling.
Bootstrapping technique to generate synthetic data:
The application of the Bootstrapping technique involves resampling from an existing dataset with replacement, thereby creating synthetic datasets that preserve statistical properties observed in the original data. This method works by randomly drawing samples from the original dataset, allowing individual data points to be selected more than once. Through this process, Bootstrapping generates multiple synthetic datasets, each mimicking the statistical characteristics of the original data. This technique is particularly valuable for estimating population parameters, constructing confidence intervals, and assessing the variability of statistical measures by iteratively resampling from the observed dataset, thus providing a robust and versatile tool in statistical analysis.
Rule based with domain specific to generate synthetic data:
Synthetic data is created based on predefined rules and constraints, applying domain-specific knowledge of relationships and dependencies within the data. For example, in finance, rules may dictate the correlation between income and expenditure. The system systematically applies these rules, considering constraints and interdependencies, to generate realistic and representative synthetic datasets. By incorporating domain expertise, Rule-Based Systems enable the controlled synthesis of data that aligns with the intricacies of real-world scenarios, offering a tailored and interpretable approach to synthetic data generation in contexts where explicit rules govern the underlying structure and relationships within the data.
Generating Synthetic data using statistical method:
Statistical model techniques for synthetic data generation involves employing parametric models, such as Gaussian distributions, to replicate the statistical properties of real-world data. The process begins by identifying the underlying distribution and parameters that characterize the observed data. For instance, if data follows a Gaussian distribution, the mean and standard deviation are estimated. The statistical model then generates synthetic data points by sampling from the identified distribution, ensuring that the generated data aligns with the statistical characteristics of the original dataset. This approach is effective for capturing complex relationships and dependencies present in real-world data, providing a controlled and interpretable method for synthetic data generation based on the statistical patterns observed in the target domain.
Building Generative Adversarial Network(GAN) to generate synthetic data:
Generative Adversarial Networks (GANs) are instrumental in obtaining synthetic image data by leveraging a two-part system: a generator and a discriminator. The generator creates synthetic images, while the discriminator evaluates their authenticity. Initially, the generator produces images from random noise. Simultaneously, the discriminator learns to differentiates between original and synthetic images. This adversarial process continues iteratively, with both components improving their capabilities. The generator refines its output to become more convincing, and the discriminator enhances its discernment. Ultimately, GANs reach equilibrium, resulting in the generation of realistic synthetic image data that closely resembles the patterns and features of real-world images. The power of GANs lies in their ability to capture complex structures and variations, making them a key technology for applications in image synthesis, enhancement, and manipulation.

Measures of synthetic data

Measuring the accuracy of synthetic data is important to ensure its effectiveness in machine learning applications. Here are some the methods which measures the accuracy of synthetic data which are listed below,

Chi-square test is a measure to find the differences between the observed and expected frequencies of values in synthetic and real datasets. Lower chi-square values indicate higher accuracy and vice versa.
Kernel Density Estimation which compares the probability density functions of synthetic and real data using kernel density estimation. Closer match in shapes indicates greater accuracy.
Mean Squared Error which computes the average squared differences between individual data points in synthetic and real datasets. Lower MSE values indicate greater similarity.
Wasserstein Distance which measures the distance between probability distributions by calculating the minimum “cost” of turning one distribution into the other. Lower Wasserstein distance signifies higher accuracy.
Kolmogorov-Smirnov statistic to compare the cumulative distribution functions (CDFs) of the synthetic and real data. A smaller statistic suggests better similarity.

Synthetic Data Generations using Python

There are number of techniques used to generated the synthetic data based on the specific use cases. so here we are going to implement some of the commonly generating a synthetic data using python library.

1. Generating a synthetic data using faker python library.

The Faker library in Python is used to generate realistic, randomized fake data for various purposes such as testing, populating databases, or creating sample datasets. It provides a simple and customizable way to generate fake names, addresses, emails, and other types of data to simulate real-world scenarios in a controlled and privacy-preserving manner.

Step 1: Install faker library by using command:

!pip install faker

Step 2: Load the faker library and generating artificial personal information about the people.

Python3

import pandas as pd

from faker import Faker
 
fake = Faker()
 
# Generate a synthetic DataFrame with columns like name, email, and job title of people

data = {'Name': [fake.name() for _ in range(100)],

        'Email': [fake.email() for _ in range(100)],

        'Job': [fake.job() for _ in range(100)]}
 
df = pd.DataFrame(data)

print(df.head())

Output:

           Name                    Email                          Job
0  Jill Morales     kemptina@example.org          Animal nutritionist
1   Jimmy Lynch     nicole46@example.com  Nature conservation officer
2   Rachel Dean    kenneth32@example.org            Buyer, industrial
3    Corey Reid   brandilong@example.net         Electronics engineer
4  Mark Ramirez  lisaperkins@example.com               Science writer

2. Generating a synthetic data using Scikit-Learn library.

Scikit-learn are powerful libraries for machine learning. They can be used to generate synthetic datasets with specific characteristics especially for classification problem.

Step 1: Install scikit-learn library by using command:

 !pip install scikit-learn

Step 2: Load the scikit-learn library and generating artificial data for classification problem.

Python3

import numpy as np

import pandas as pd

from sklearn.datasets import make_classification
 
# Generate synthetic binary classification data

X, y = make_classification(n_samples=1000, n_features=5, n_informative=3, n_classes=2, random_state=42)
 
# Create a DataFrame from the NumPy arrays

columns = [f"feature_{i}" for i in range(X.shape[1])]

df = pd.DataFrame(data=X, columns=columns)

df['target'] = y
 
# Print the first few rows of the DataFrame
df.head()

Output:

   feature_0  feature_1  feature_2  feature_3  feature_4  target
0  -0.065300  -0.717214   0.393952  -0.934473   1.681514       0
1   0.567015  -0.044606   1.612851  -1.350174   2.488878       0
2  -0.247215  -0.650569  -0.743500  -1.214190   0.841110       0
3   1.145870   0.974224   1.562506  -2.277010   2.276521       1
4   0.599605  -0.427545   2.374472  -1.503510   3.604959       0

3. Generating a synthetic data using bootstrap sampling in Numpy library

Bootstrap sampling is a resampling technique that involves randomly drawing samples with replacement from a dataset. This technique can be generated by using numpy library in python.

Step 1: Install numpy library by using command:

!pip install numpy

Step 2: Load the numpy library and generating artificial numerical integers data.

Python3

import numpy as np

import pandas as pd
 
# Original dataset

original_data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
 
# Bootstrap sampling function

def bootstrap_sample(data, num_samples=1000):

    synthetic_data = np.random.choice(data, size=(num_samples, len(data)), replace=True)

    return synthetic_data
 
# Generate synthetic data

synthetic_data = bootstrap_sample(original_data)
 
# Create a DataFrame from the synthetic data

num_columns = synthetic_data.shape[1]

column_names = [f'Column_{i+1}' for i in range(num_columns)]

synthetic_df = pd.DataFrame(synthetic_data, columns=column_names)
 
# Print the first few rows of the synthetic DataFrame
synthetic_df.head()

Output:

   Column_1  Column_2  Column_3  Column_4  Column_5  Column_6  Column_7  \
0         9         8         6         9         5         2        10   
1         9         7         6         5        10         9         8   
2         6         3         1         3         1        10         1   
3         8        10         4         3         4         3         9   
4         4         6         6         6        10         9         2   
   Column_8  Column_9  Column_10  
0         2         6          2  
1         7         6          2  
2         4         1          5  
3         8         1          4  
4         4         4          6

4. Generating synthetic data using Gaussian statistical model in Numpy library.

Gaussian statistical models are used to generate synthetic data by assuming a normal distribution, defined by mean and standard deviation parameters. This can be achieved through using Numpy library in python

Step 1: Install numpy library by using command:

 !pip install numpy

Step 2: Load the numpy library and generating artificial gaussian distributed data.

Python3

import numpy as np

import pandas as pd
 
# Parameters for the normal distribution (mean and standard deviation)

means = [5, 10, 15, 20, 25]  

std_devs = [2, 2, 2, 2, 2]  
 
# Generate synthetic data points using random sampling from a gaussian distribution for each input variable

num_samples = 1000

synthetic_data = np.random.normal(means, std_devs, size=(num_samples, len(means)))
 
# Create a DataFrame from the synthetic data

column_names = [f'Input_{i+1}' for i in range(len(means))] + ['Output']

synthetic_df = pd.DataFrame(data=np.hstack([synthetic_data, synthetic_data.sum(axis=1, keepdims=True)]), 

                            columns=column_names)
 
# Print the first few rows of the synthetic DataFrame
synthetic_df.head()

Output:

    Input_1    Input_2    Input_3    Input_4    Input_5     Output
0  4.780425  13.162034  16.231576  21.169664  26.686617  82.030316
1  7.178791   8.808152  14.307585  14.125770  25.779506  70.199805
2  5.036429   9.160681  13.157213  20.185858  25.250525  72.790706
3  6.982617   8.553333  12.821710  20.714012  24.872681  73.944353
4  3.843489   9.966772  14.924961  20.377723  29.385274  78.498219

Application of synthetic data in machine learning

Synthetic data plays a crucial role in few aspect in the process of machine learning which are listed below,

It helps in data augmentation process by augmenting existing datasets, helping improve model performance by providing additional diverse examples for training.
Synthetic data aids anomaly detection by creating realistic outliers and anomalies, allowing machine learning models to better identify and handle unexpected patterns or anomalies in real-world data.
synthetic data allows to reduce biases present in real data can be addressed, promoting fairness and reducing algorithmic bias in machine learning models.
Synthetic data supplements real-world datasets, providing additional examples for model training, especially in scenarios where obtaining sufficient authentic data is challenging.
Synthetic data is valuable for simulating rare or extreme events, allowing machine learning models to be trained effectively on scenarios that may have limited occurrences in real-world data.

Limitation of synthetic data

While synthetic data offers numerous advantages in machine learning, it also has certain limitations that need to be considered which are listed below,

Synthetic data may lead to overfitting, where the model performs well on the synthetic data but poorly on real-world data. This is because the synthetic data may not adequately represent the full range of real-world data variations.
Creating accurate and realistic synthetic data often requires domain expertise to understand the underlying patterns and relationships in the real data.
Generating high-quality synthetic data can be computationally expensive, especially for complex data types like images or natural language.
While synthetic data can protect privacy by not using real-world data, there is still a risk of re-identifying individuals based on the synthetic data, especially if it contains sensitive attributes.

Conclusion

Finally we conclude, synthetic data proves to be a valuable asset in the field of machine learning, offering solutions to data scarcity, privacy concerns, and biased datasets. Through various generation methods such as random sampling, bootstrapping, and advanced techniques like GANs, synthetic data replicates the statistical characteristics of real-world data, enabling researchers and practitioners to conduct experiments, test algorithms, and develop models without compromising sensitive information. While synthetic data introduces cost-effectiveness, bias mitigation, and privacy preservation, it is not without limitations. Careful consideration of its potential for overfitting, the need for domain expertise in generation, computational costs, and potential re-identification risks is essential.

Frequently Asked Questions on Synthetic Data

1. Why do we need synthetic data?

Synthetic data is essential when obtaining real-world data is difficult or risky due to privacy concerns. It provides a substitute for training machine learning models, ensuring robust performance in scenarios with limited or sensitive data.

2. Can synthetic data replace real data?

Synthetic data cannot completely replace real data in all situations. While it has advantages, such as privacy preservation and addressing data scarcity, synthetic data may not fully capture the complexity and variability of real-world scenarios.

3. When was synthetic data invented?

In 1993, statistician Donald Rubin proposed the concept of “synthetic data” as a way to protect privacy in statistical analysis. He introduced the idea of generating artificial data that preserves the statistical properties of real data without revealing any confidential information.

4. Who uses synthetic data?

Synthetic data is utilized by data scientists for model training, especially in scenarios involving privacy concerns or limited datasets. Industries such as healthcare, finance, and autonomous vehicles leverage synthetic data to develop and test algorithms without compromising sensitive information.

5. What is the difference between original data and synthetic data?

Original data comes from real-world observations, capturing authentic information with potential privacy concerns. Synthetic data is artificially generated, mimicking real data patterns but lacking full authenticity.

6. How many types of synthetic data are there?

Synthetic data can be generated through methods like random sampling, parametric models, and neural networks like Generative Adversarial Networks (GANs). Rule-based systems, copulas, and domain-specific simulators are additional approaches, providing diverse options for creating artificial datasets in various applications.

Article Tags :

AI-ML-DS

Data Science

Machine Learning