Airflow tutorial 6: Build a data pipeline using Google Cloud Bigquery
In this tutorial, we will build a data pipeline using Google Cloud Bigquery and Airflow
The GitHub links for this tutorial
A quick look at this tutorial
This tutorial is inspired by this blog post from the official Google Cloud blogs.
We will be using 2 public datasets hosted on Google BigQuery:
- Github Archive: 30 million events monthly, including issues, commits, and pushes on Github.
- Hacker news: contains a full daily update of all the stories and comments from Hacker News.
From the public datasets, we will create a data pipeline for aggregating daily stats about Github repos and the associated Hacker News story and pushing the result into a new joint table every day. The final table will be used to power this visualization report: GitHub on Hacker News trends dashboard created using Google Data Studio.
Overview of Google BigQuery
Google BigQuery is a big data analytics product from Google that helps you run ad-hoc analysis on massive dataset using Google Cloud infrastructure. With the power BigQuery, you can run a query to analyze terabytes of data within seconds.
Some characteristics of Google BigQuery:
- Serverless: is an execution model where the cloud provider (AWS, Google) will dynamically allocate the resources to run your code, and only charge for that amount of resources.
- Highly Scalable: can scale to tens of thousands of machines in Google Data Center.
- High performance: massively parallel execution and automatic performance optimization.
- High Availability: Automatically replicates data between zones to enable high availability. It also automatically load balances to provide optimal performance and to minimize the impact of any hardware failures.
The advantage of choosing Google BigQuery for this tutorial:
- Access to Google BigQuery public datasets: Google pays for the storage of these datasets and provides public access to the data via your cloud project. You pay only for the queries that you perform on the data. Moreover, there’s a 1TB per month free tier for queries, making getting started super easy.
- Real-time analysis of massive datasets
- No-ops (no operations): no need to do operations work like setting up or maintaining any infrastructure. Just focus on your code.
Working with Google BigQuery
- The first step you need to do is sign in to your Google account
- Search for Google Cloud, sign up
- Go to Google cloud console
- Search for Bigquery in the navigation bar.
In this tutorial, we will be using 2 public datasets hosted on Google BigQuery:
- Github Archive: 30 million events monthly, including issues, commits and pushes on Github.
- Hacker news: contains a full daily update of all the stories and comments from Hacker News.
Exploratory data analysis
The first step in working with any new datasets is to do some analysis to explore the data.
You can either use:
- the BigQuery Web UI to run your adhoc queries.
- go through my jupyter notebook to reproduce my analysis.
Steps to run my jupyter notebook docker environment and reproduce the analysis:
- Clone this airflow-tutorial repo
- Go to the
notebooks
directory, you should see adocker-compose.yml
file
cd notebooks
- Run this command in the terminal to start the docker environment
docker-compose up
- You should see a similar output in your terminal with the first time login token:
Copy/paste this URL into your browser when you connect for the first time, to login with a token:
http://(7b568d3eba0f or 127.0.0.1):8888/?token=7d229945745ecd113adc572d50d93c98e3552516ee65a432
- Copy the token and open a new web browser, enter:
localhost:8889/?token=7d229945745ecd113adc572d50d93c98e3552516ee65a432
- That’s it. You should now see the
work/github-trend-analysis.ipynb
file
Building the data pipeline
Setting up
In order to run this data pipeline example, you need to set up a few things first. Specifically:
- Create a service account (Cloud Console)
- Setup a Google Cloud Connection in Airflow
- Supply the config variables
Follow this instruction to set up and run your DAG.
Running the Airflow docker environment
I have already created a new docker environment to run this data pipeline example. Steps to run the airflow environment:
- Check out the Github master branch of this tutorial
- Start the Airflow environment with docker
bash run_gcloud_example.sh
- Stop the Airflow environment when you are finished
bash stop_gcloud_example.sh
Test your DAG
After connection and config variables have been set up using this instruction, you can now test and run your DAG.
- Using the command below to test specific task in the DAG:
docker-compose -f docker-compose-gcloud.yml run --rm webserver airflow test [DAG_ID] [TASK_ID] [EXECUTION_DATE]
- Examples:
# Task 1
docker-compose -f docker-compose-gcloud.yml run --rm webserver airflow test bigquery_github_trends bq_check_githubarchive_day 2018-12-01
# Task 2
docker-compose -f docker-compose-gcloud.yml run --rm webserver airflow test bigquery_github_trends bq_check_hackernews_full 2018-12-01
Leave a comment