How to Create a Custom Synthetic Dataset in R

Last Updated : 12 Apr, 2024

Making synthetic datasets in R Programming Language is like creating pretend data that looks real. These datasets act like real ones, so you can test things out and study them closely. Here we’ll show how to make our own synthetic datasets using R. It’s easy and gives you the freedom to play around with data in exciting new ways.

What is Synthetic Data?

Synthetic data is like making up information that acts just like real data. Instead of gathering it from the real world, we create it using computer programs or mathematical rules. It’s useful because it helps us study and test things without using real people or sensitive data. It’s a handy tool for researchers and companies to explore ideas and build algorithms without privacy concerns or data limitations.

Features of Synthetic Data

Privacy Protection: Synthetic data keeps personal information safe because it’s made up, and not collected from real people.
Data Augmentation: It adds more data to existing sets, which is handy when there’s not enough real data for training models.
Diverse Scenarios: Synthetic data creates different situations, helping test models in various conditions.
Cost-Effective: It saves money because you don’t need to collect real data, which can be expensive.
Risk Reduction: Since it’s not real, there’s no risk of data breaches or legal issues.
Testing Algorithms: It’s great for trying out and improving algorithms without using real data.

Creating Synthetic Dataset in R

We will take random data values for creating Synthetic Dataset in R Programming Language.

Step 1: Define Variables

Determine the variables in the dataset and their characteristics such as data type, range, and distribution.

n <- 1000  # Number of observations
mean_age <- 40  # Mean age
sd_age <- 10  # Standard deviation of age
min_salary <- 20000  # Minimum salary
max_salary <- 80000  # Maximum salary

We define the variables and parameters for our synthetic dataset, including the number of observations, mean and standard deviation of age, and minimum and maximum salary.

Step 2: Generate Data

Use functions to generate data for each variable based on the defined characteristics.

# Set seed for reproducibility
set.seed(123)  
# Generate ages and round to whole numbers
age <- round(rnorm(n, mean = mean_age, sd = sd_age))  
#Generate salaries
salary <- runif(n, min = min_salary, max = max_salary)

Generate synthetic data for ‘age’ and ‘salary’ variables using the rnorm() function for age (normal distribution) and the runif() function for salary (uniform distribution).

We round the generated ages to the nearest whole number using the round() function.

Step 3: Combine Data

Assemble the generated data into a dataframe or any other suitable data structure.

synthetic_data <- data.frame(age, salary)

We combine the generated data into a dataframe called ‘synthetic_data’ using the data.frame() function.

Step 4 : Adding Noise or Randomness

If desired, add noise or randomness to the generated data to make it more realistic.

# Standard deviation of the noise
noise_sd <- 5000  
synthetic_data$salary <- synthetic_data$salary + rnorm(n, sd = noise_sd)

Adding noise to the ‘salary’ variable to introduce variability.

We specify the standard deviation of the noise (noise_sd) and use the rnorm() function to generate random noise with that standard deviation, which is then added to the ‘salary’ variable.

Step 5: Checking Head of Synthetic Dataset

head(synthetic_data)

Output:

  age   salary
1  34 25475.51
2  38 27134.66
3  56 24440.33
4  41 54001.40
5  41 55171.41
6  57 67616.63

Create a Custom Synthetic Dataset of Study Hours Vs Exam Scores

# Step 1: Generate Synthetic Data
set.seed(123)  # Set seed for reproducibility
study_hours <- round(runif(100, min = 1, max = 10)) 
exam_scores <- round(70 + study_hours * 5 + rnorm(100, sd = 5))  

# Step 2: Analyze the Data
correlation <- cor(study_hours, exam_scores)  
lm_model <- lm(exam_scores ~ study_hours)  

# Step 3: Visualize the Results
plot(study_hours, exam_scores, main = "Study Hours vs. Exam Scores", 
          xlab = "Study Hours", ylab = "Exam Scores")
abline(lm_model, col = "red")  # Add regression line
text(5, 90, paste("Correlation:", round(correlation, 2)), col = "blue")

Output:

Create a Custom Synthetic Dataset in R

Generate synthetic data for study hours and exam scores, assuming a linear relationship between them.

Calculate the correlation between study hours and exam scores to measure their association.
Fit a linear regression model to examine how study hours predict exam scores.
Then visualize the relationship between study hours and exam scores using a scatter plot with a regression line.

Limitation of Synthetic Dataset

Limited Real-World Representation: Synthetic datasets may not capture the full complexity and variability of real-world data.
Potential Bias: The generation process can introduce biases if it doesn’t accurately reflect the true characteristics of the population.
Lack of Context: Synthetic datasets often lack contextual information present in real-world data, impacting their usefulness for analysis.
Limited Generalizability: Models trained on synthetic data may not perform well on real-world data due to differences in distribution or underlying patterns.
Validation Challenges: It can be difficult to validate models trained on synthetic data without real-world testing opportunities.

Conclusion

Synthetic datasets are helpful for exploring different scenarios and relationships in data analysis. However, they’re not perfect copies of real-world data. They might miss some details, have biases, or be challenging to validate. It’s essential to use them carefully, alongside real data when possible.

Suggest improvement

How to Create a Strip Chart in R?

Share your thoughts in the comments