Open In App

Google Cloud Platform – Loading Data to BigQuery

Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will look into how to load and analyze your own data in BigQuery.  As it is better to understand the concept with examples, we will be answering the age-old question “Which is better, cats or dogs?”

If you want to analyze data that are not already available as part of the public data sets program or hosted publicly by another BigQuery user, you’ll need to load your own data into BigQuery. How you load the data depends on your analytics needs and your data pipeline. If your data slowly changes or needs to be loaded in a one-time analysis, you may be fine with loading the data into BigQuery in batch. But if you need to ingest and analyze data close to real-time, in this case, you may need to stream your data into BigQuery.

BigQuery has options for loading data that cover both of these scenarios. So let’s get started by loading data in a batch. So what data do you need in BigQuery in order to officially crown the winner of the cats versus dogs battle? We’re going to determine the champion by analyzing college basketball tournament games in order to see who wins when dog mascots and cat mascots go head to head.

 To run this analysis, we’ll need two things. First, we’ll need to have the tournament results data, which is already available in BigQuery as part of the public NCAA basketball public data set. Second, we’ll need a list that has teams with dog and cat mascots, which we have available as a local CSV file. To join these two data sets together for analysis, we’ll need to load the mascot CSV file into BigQuery and create a table. 

 First, create a home for the mascot table in BigQuery. BigQuery organizes data into containers called data sets. These data sets function somewhat like top-level folders that manage underlying tables.

Now follow the below steps to load the data:

Step 1: To create a new data set, select the project name on the left-hand nav and click the Create Data Set button.

Step 2: Then we’ll give the data set a name and then decide on a location. In this case, we need to co-locate the data set with the NCAA public data set, which is located in the US multi-region. We’ll need to reference both tables in one query by performing a join. And this can only be done with tables residing in the same geographical location. You can always view the data set the location by clicking on the Details tab in the web UI. Click Create Data Set and the new data set will appear in the left-hand nav.

Step 3: Now it’s time to create a new table within the data set by loading the mascot CSV file. Highlight the data set and click Create Table. 

This dialog allows us to directly upload files from our local machine up to 10 megabytes in size and containing less than 16,000 rows. If you have something larger you can just upload it to Cloud storage and then select it from there. Since our CSV file is pretty small, we can skip that and use the browse functionality to select the file from our local machine.

Step 4: Give the table a name and then define the schema. The schema is a list of each column and its data type. We can define the schema manually by clicking on Add Field or check the box to have BigQuery auto-detect it.

Step 5: Click Create Table and a load job will be created. Once the data has finished loading, you can navigate to view the table details, review the schema, and preview the data right in the console. Our mascots table is ready to query. 

Step 6: We’ll paste in a query that uses the mascot table to analyze cat versus dog tournament game match-ups. In our query, we’ve started with our table of tournament games and then used a series of SQL joins with the mascots table to lookup an animal classification for the winning and losing teams. We then sum up the number of wins for cats and the number of wins for dogs in the specific cat versus dog match-up games. Our query will be as given below:

#standardSQL

WITH matchups AS {
SELECT g.win_team_id

,g.lose_team_id
,(SELECT win_masc.tax_genus FROM 'analytics-testing-321.basketball.mascots' win_masc WHERE win_masc.id = g.win_team_id) AS tax_genus_winner 
,(SELECT lose_masc.tax_family FROM 'analytics-testing-321-basketball.mascots' lose_masc WHERE lose_masc.id = g.lose_team_id) AS tax_family_loser
,(SELECT win_masc.tax_family FROM 'analytics-testing-321.basketball.mascots win_masc WHERE win_masc.id =g.win_tean_id) AS tax_family_winner
, (SELECT lose_masc.tax_genus FROM 'analytics-testing-321-basketball.mascots lose_masc WHERE lose_masc.id = g.lose_team_id) AS tax_genus_loser
 FROM 'bigquery-public-data.ncaa_basketball.mbb_historical_tournament_games' g
 )
SELECT
SUM(IF(tax_family_winner = "Felidae" AND tax_genus_loser = "Canis", 1, 0)) AS num_cat_wins,
SUM(IF(tax_genus_winner = "Canis" AND tax_family_loser = "Felidae", 1, 0)) AS num_dog_wins
FROM matchups

Step 7: Now run our query. And there we have it, with 43 of the wins, dog mascots win in the realm of college basketball tournament games. You can also see how dogs and cats perform on other metrics or join this data with other data sets to test new ideas. 


Last Updated : 30 Mar, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads