Open In App

ChatGPT Prompt to get Datasets for Machine Learning

Improve
Improve
Like Article
Like
Save
Share
Report

With the development of machine learning, access to high-quality datasets is becoming increasingly important. Datasets are crucial for assessing the accuracy and effectiveness of the final model, which is a prerequisite for any machine learning project. In this article, we’ll learn how to use a ChatGPT[OpenAI] template prompt to collect a variety of datasets for different machine learning applications and collect these datasets in Python.

Step for Generating Dataset using ChatGPT

Step 1: Install OpenAI library in Python

!pip install -q openai

Step 2: Import OpenAI library in Python

Python3




import openai


Step 3: Assign your API key to OpenAI environment variable

Python3




openai.api_key = "YOUR_API_KEY"


Step 4: Create a custom function to call ChatGPT API

Python3




def chat(message):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": f"{message}"},
        ]
    )
    return response['choices'][0]['message']['content']


Step 5: Call that function and pass in the prompt

res = chat('Massage')
print(res)

Prompts to Gather/Generate Datasets for Machine Learning

Prompt 1:

Create a list of datasets that can be used to train {topic} models. Ensure that the datasets are available in CSV format. The objective is to use this dataset to learn about {topic}. Also, provide links to the dataset if possible. Create the list in tabular form with the following columns: Dataset name, dataset, URL, dataset description

Python3




prompt ='''
Create a list of datasets that can be used to train logistic regression models. 
Ensure that the datasets are available in CSV format. 
The objective is to use this dataset to learn about logistic regression models 
and related nuances such as training the models. Also provide links to the dataset if possible.
Create the list in tabular form with following columns:
Dataset name, dataset, URL, dataset description
'''
res = chat(prompt)
print(res)


Output:

Dataset name | Dataset | URL | Dataset description--- | --- | --- | ---Titanic - Machine Learning from Disaster | titanic.csv | https://www.kaggle.com/c/titanic/data | Contains data on passengers of the Titanic, including features such as age, sex, and class, along with whether they survived or not.Red Wine Quality | winequality-red.csv | https://archive.ics.uci.edu/ml/datasets/wine+quality | Contains data on various physiochemical properties of red wine and their associated quality ratings.Bank Marketing | bank-additional-full.csv | https://archive.ics.uci.edu/ml/datasets/bank+marketing | Contains information on a bank's telemarketing campaign, including contact details of customers and whether they subscribed to a term deposit or not.Breast Cancer Wisconsin (Diagnostic) | wdbc.csv | https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) | Contains data on various features extracted from digitized images of breast cancer biopsies, along with whether the biopsy was benign or malignant.Adult | adult.csv | https://archive.ics.uci.edu/ml/datasets/Adult | Contains demographic data on individuals, along with whether their income exceeds a certain threshold or not.Heart Disease | heart.csv | https://www.kaggle.com/ronitf/heart-disease-uci | Contains data on various medical measurements taken on individuals, along with whether they have heart disease or not. Pima Indians Diabetes | pima-indians-diabetes.csv | https://www.kaggle.com/uciml/pima-indians-diabetes-database | Contains data on various medical measurements taken on Pima Indian women, along with whether they have diabetes or not.Iris | iris.csv | https://archive.ics.uci.edu/ml/datasets/Iris | Contains data on various measurements taken on iris flowers, along with their species. Loan Prediction | train.csv | https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/#ProblemStatement | Contains various demographic data on loan applicants, along with whether their application was approved or not.

Prompt 2:

Generate a dummy dataset to train and test a {machine learning model name} for educational purposes.

Python3




res = chat('generate a dummy dataset to train and test a logistic regression model\
for educational purposes. Ensure that the dataset is available in csv format')
print(res)


Output:

Here is an example dummy dataset in CSV format for educational purposes:
```
Age,Gender,Income,Education,Employment_Status,Marital_Status,Loan_Approval
23,Male,25000,High School,Unemployed,Single,Not Approved
32,Female,45000,Bachelor's Degree,Employed,Married,Not Approved
45,Male,120000,Master's Degree,Employed,Married,Approved
38,Female,60000,Bachelor's Degree,Employed,Married,Approved
26,Male,32000,Associate's Degree,Employed,Single,Not Approved
29,Female,28000,High School,Employed,Single,Not Approved
41,Male,80000,Doctoral Degree,Employed,Divorced,Approved
54,Male,95000,Master's Degree,Employed,Married,Approved
```
The dataset contains demographic and financial information for eight individuals along with whether or not they were approved for a loan. The goal is to train a logistic regression model to predict loan approval based on the other variables.

Prompt 3:

List down datasets to practice {topic}, and the if possible also attach dataset links and descriptions. Create the list in tabular format

Python




prompt ='''
List down datasets to practice object detection, 
if possible also attach dataset links and description. 
Create the list in tabular format
'''
res = chat(prompt)
print(res)


Output:

| Dataset         | Link                                                      | Description                                                           |
| :-------------- | :-------------------------------------------------------- | :-------------------------------------------------------------------- |
| COCO            | http://cocodataset.org/#home                               | Common Objects in Context dataset, contains over 330K images         |
| Pascal VOC      | http://host.robots.ox.ac.uk/pascal/VOC/                    | Pascal Visual Object Classes dataset, contains 20 object categories   |
| Open Images     | https://storage.googleapis.com/openimages/web/index.html   | Contains over 9M images with object-level annotations                 |
| ImageNet        | http://www.image-net.org/                                  | Large-scale dataset with over 14M annotated images and 21k categories |
| KITTI           | http://www.cvlibs.net/datasets/kitti/                      | Contains images of street scenes with object-level annotations        |
| BDD100K         | https://bdd-data.berkeley.edu/                              | Large-scale diverse dataset for autonomous driving                    |
| DOTA            | https://captain-whu.github.io/DOTA/index.html              | Large-scale aerial images dataset with object detection annotations   |
| WIDER FACE      | http://shuoyang1213.me/WIDERFACE/                          | Contains 32k images of faces with bounding box annotations            |
| VisDrone        | http://www.aiskyeye.com/                                   | Contains 10k images with annotations of various objects               |
| MS COCO Text    | http://www.robots.ox.ac.uk/~vgg/data/scenetext/             | Contains 63k images with text annotations                             |
These datasets can be used with popular object detection frameworks such as TensorFlow, PyTorch, and Keras.

Prompt 4:

Create a list of datasets for practicing on {topic}. Make sure they are available in CSV format. Also, provide links to the dataset.

Python3




prompt ="""
Create a list of datasets for practicing on machine translation from english to hindi. 
Make sure they are available in text format. 
Also, provide links to the dataset.
"""
res = chat(prompt)
print(res)


Output:


1. TED Talks Corpus: This dataset contains parallel transcripts of TED talks in English and Hindi. It is available in text format and can be downloaded from the official website: https://www.ted.com/participate/translate
2. United Nations Parallel Corpus: This corpus contains parallel texts in Hindi and English from speeches delivered by UN delegates. It is available in text format and can be downloaded from the official website: https://conferences.unite.un.org/UN/corpus
3. OPUS Corpus: This corpus contains parallel texts in various languages including Hindi and English. It includes data from a wide range of domains such as news, legal documents, and subtitles. It is available in text format and can be downloaded from the official website: http://opus.nlpl.eu/
4. Bible Corpus: This dataset contains parallel texts of the Bible in Hindi and English. It is available in text format and can be downloaded from the official website: http://christos-c.com/bible_data/
5. Indian Language Parallel Corpus: This corpus contains parallel texts in Hindi and other Indian languages. It includes data from various domains such as news, novels, and Wikipedia articles. It is available in text format and can be downloaded from the official repository: https://github.com/AI4Bharat/indic-corpus
6. Covid-19 India Parallel Corpus: This corpus contains parallel texts in Hindi and English related to the Covid-19 pandemic in India. It includes data from news sources, government advisories, and social media. It is available in text format and can be downloaded from the official website: https://github.com/AI4Bharat/covid19-news/blob/master/parallel-corpus.md
7. BookCorpus: This dataset contains parallel texts of novels in Hindi and English. It is available in text format and can be downloaded from the official website: https://github.com/soskek/bookcorpus/tree/master/data
Note: Some of these datasets may require some preprocessing and cleaning before using for machine translation purposes.

To learn more about Chat GPT, you can refer to:

Frequently Asked Questions (FAQs)

1. What are the best prompts for data analysis in ChatGPT?

You can enter a new text prompt and provide data to ChatGPT, though it won’t be possible for you to pass an entire dataset to ChatGPT though their API. You can use PandasAI to perfom analysis on your dataset.

2. How do I download datasets for machine learning?

You can open the links provided by ChatGPT and download the CSV files from the website.

3. How do I set up a prompt?

While setting up a prompt, keep your sentences short and concise. Explain the task that you want ChatGPT to perform and give a brief explanation in case of complex tasks.

Conclusion:

In this article, we have seen how one can utilize ChatGPT API[OpenAI] in Python to gather or generate datasets to practice machine learning algorithms. ChatGPT is a one-stop good resource for dataset generation/gathering and many other applications. But keep in mind, these datasets are dummy or training datasets that can be used for practice purposes.



Last Updated : 26 May, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads