What is Google Dataset Search and How to Use It?
Are you a Data Scientist trying to find details about the job market in the US? You can find Datasets on the US job market or on the global job listings. Are you a biologist studying the DNA? Well, you can find Datasets on the human DNA sequence, DNA Repair rates, etc. Or are you just a cat lover? Even then, you can find Datasets on cats per household or Datasets on cute cat images. In other words, you want to find data on any possible topic, you can find Datasets on the internet! Even for cats!!!
And the best way to find these Datasets is the Google Dataset Search which provides a single platform for many Datasets so you can search and find your data in one place. In this article, you will get to know more about Google Dataset Search and how to find DataSets on it. But first, let’s address the most fundamental question i.e. “What is a Dataset?” so that there are no doubts while moving on.
What is a Dataset?
Simply put, a Dataset is a collection of data! But if you want a more complex explanation, a Dataset can be a single database table, a collection of tables, a data matrix, etc. where each column in the Dataset corresponds to a data variable and each row provides an instance of the data set. Now, are you wondering why are Datasets even important?
Datasets are essential in Data Science and Machine Learning. If your Dataset is not good enough, the Machine Learning model will fail no matter how good the use case or your data scientists are! In fact, Datasets are used all through the ML project development right from training the ML model to tuning it and then testing it. The three Datasets used are the training set, the validation set, and the testing set. The Training Dataset trains the ML algorithm to apply concepts such as Artificial Neural Networks to learn something and produce the desired output. This Dataset contains both the input data and the output that is expected from the ML algorithm. After the training Dataset, the Test Dataset is used to check how well the ML algorithm was trained using the training Dataset. The test Dataset contains the input data and the output is verified to be correct, usually by human verification. Finally, the Validation Dataset is used to fine-tune the final ML algorithm so that it can be used.
Now you have seen how important Datasets are for Data Science and Machine Learning. In fact, without a Dataset, there is no Machine Learning algorithm! Therefore, it is very important to have good quality and reliable Datasets that can be used for training the ML models. But where to find these Datasets? This is where the Google Dataset Search comes in! Let’s understand what that is now.
What is Google Dataset Search?
Many governments in the world and other private bodies publish their data online. In fact, the United States has over 2 million open government Datasets available for people to access and use. And Google Dataset Search helps you in finding these Datasets!
Google Dataset Search is a version of Google’s search engine that can specifically be used to search for Datasets in fields such as machine learning, social sciences, government data, geosciences, biology, life sciences, agriculture, etc. from all over the world. According to Google, their Dataset Search has indexed around 25 million Datasets and you can access them all to obtain useful data. Google also believes that Dataset Search will help in creating a data-sharing ecosystem wherein governments and private companies who have Datasets will be able to share them using the best practices for data storage and publication. Most openly available Datasets use schema.org which is an open standard. This means that anyone is free to download and use these Datasets for research, business analytics, training an ML model, etc.
So Google Dataset Search uses schema.org and other metadata standards to find these Datasets in their search results. If you have a Dataset that is not openly available, you can even ensure that others can see your Dataset on Google Dataset Search by adding the schema.org description.
Google Dataset Search also provides some conditions on what can be qualified as a Dataset. This includes
- A table which contains data
- A collection of tables in an organized form
- A file that contains data in a proprietary format
- A collection of files in an organized form that creates a Dataset
- Images which capture some form of data
- Files with Machine Learning trained parameters or neural network structure definitions
- Anything that is not on this list but looks like a Dataset to you
How to search Datasets on Google Dataset Search?
It is as simple to search Datasets on Google Dataset Search as it is to search for anything on Google Search! You just enter the topic on which you need to find a Dataset in the Google Dataset Search and click Search. For example, If you want to find Datasets on COVID-19 just type in “COVID 19” and search away. You will get the most relevant Datasets relating to COVID-19 and you can also customize your search based on when the Datasets were last updates, what is their download format, are they allowed for commercial usage or not, are they free or not, etc.
As you can see in this screenshot, the first Dataset in the search is provided by the World Health Organization and contains both images and tabular data on the spread of COVID-19 around the globe.
Google Dataset Search also allows you to easily find public Datasets that are published by different governments on topics such as population census in the country, national financial reports, weather reports, and other statistics. You can use these Datasets for research, business analytics, completing your thesis, and so on. For example, if you want to find Datasets related to the government of Canada, you can type “Canada government” and search away! You will get various Datasets that are available to the Google Dataset Search and related to the Canadian government.
As you can see in the screenshot, the first Dataset in the search is all the consultations submitted by departments and agencies in the Government of Canada. The second Dataset is the Government of Canada Employee Contact Information and so on.
Another important thing to mention Google Dataset Search is that you can view all the scholarly articles that cite a Dataset or are otherwise connected to a Dataset from Google Scholar. As you can see in the above screenshot, a link is provided for all the 12 scholarly articles cite the Government of Canada – Consultations Dataset. On clicking this link, you can see all the scholarly articles on Google Scholar.