Google Cloud Platform – Woking with External Data in BigQuery
In BigQuery it’s also possible to query data stored externally or outside BigQuery. In this article, we’re diving into these external data sources. It’s possible to leave your data in any place and use BigQuery as your query engine. These sources are called external or federated data sources. This functionality is currently supported for data residing in Google Drive, Cloud Storage, Cloud SQL, and Bigtable.
Before we look closer at how to query these sources, let’s discuss a few notable differences you’ll experience with external data sources:
- First to note is that the query performance for external data sources may not equal the query performance of data stored in BigQuery. So if query speed is a priority in your case, you may want to load the data directly into BigQuery.
- Secondly, when querying an external source, BigQuery cannot predict the amount of data to be processed. So you will only know after you run your query.
- Finally, the results are not cached like they would be when querying data stored in BigQuery. Caching is a great way to save costs and improve performance on repeated queries when the underlying data hasn’t changed.
Overall, this feature is best for short-term, less frequently accessed data. For example, you could use external data sources to support loading and transforming your data in just a single pass. In this workflow, you query the external source, transform the data as part of the query, and then write the results as a permanent table in BigQuery storage.
Another use case is for joining small amounts of frequently changing data with data stored in BigQuery. By keeping the frequently changing data as an external data source, it does not need to be reloaded into BigQuery every time it is updated. An example here is querying data that lives in a Google Sheet. Even as the Sheet is edited in real-time, you can run queries over the data and the results will reflect the live, up-to-date information.
So now let’s look at how to set up external data sources in BigQuery. In this example, we’ll run a query over a collection of JSON files located in a Cloud storage bucket. These files have the flight performance data for all domestic flights in the United States from 2014.
Step 1: Starting in the console, create a new data set. Highlight your project name in the left-hand nav and click Create Data Set. Name the data set flight_performance, choose the US for the location, and click Create Data Set.
Step 2: Now highlight your new data set and click Create Table.
Under Source, choose Google Cloud Storage.
Note: If your data was located in Google Drive, such as in a Google Sheet or in Bigtable, you would choose those alternate sources in the dropdown.
The flight performance data is located in a public bucket which any GCP user can access with the URI address. Paste the URI into the GCS bucket field. You’ll see there’s a wild card character in the URI, which indicates to BigQuery to include all the JSON files that adhere to the specific naming convention. Next, under Destination, make sure you set the table type to external. Name the table 2014. And you can auto-detect the schema in this case. Finally, click Create Table.
Since there is no data ingestion involved, you’ll see the table immediately populate under the data set. Look at the table details where you can see the external configuration that you just set up and table size of 0 bytes because your external table does not use any BigQuery storage.
Now you can run a query that references the external table. In our query we’re selecting account of all flights by the carrier. As we discussed earlier, you cannot see the amount of data processed until after the query completes. In this example, you created a permanent for the external data source. However, you can also query an external source using a temporary table which is useful for one-time ad hoc queries or for ETL processes.