Being able to automatize deployments makes it easier to replicate environments, which allows us to rapidly build and test data pipelines at scale. At Astrafy, we deploy our own Airflow environments and help our clients with their own deployments. That’s why we decided to automatize this process with Terraform and make it easy to deploy Airflow by simply declaring some variables and calling a Terraform module.
Deploying Airflow on Google Cloud Platform (GCP) using Terraform presents some advantages over other tools like Google Cloud Composer. In this article, we delve into the steps to accomplish this and explore the perks of this approach.
Cloud composer comparison
You may ask why should you want to deploy Airflow yourself in Google Cloud if there is an offering inside the platform that does exactly that for you. We firmly believe deploying your own instance of Airflow presents several advantages over a managed version of it.
However, this is not a silver bullet, if you want to get started with your data pipelines as soon as possible and managing infrastructure is not something you consider, going for a managed solution like Cloud Composer is definitively the way to go.
1. Cost-Effectiveness
Deploying Airflow on GKE is more cost-effective. The fixed costs of the managed version are already pretty high and get increased from time to time. Besides, you have more granular control over the resources being provisioned. This way, you only pay for the resources you use.
2. Customization and Control
You have more control and flexibility over your environment with Airflow and Terraform. You can customize your Airflow deployment to suit your specific requirements and workflow.
3. Version Control
With Terraform, you can manage your infrastructure as code and version control it just like you would with application code. This capability makes it easier to manage and track changes to your infrastructure over time.
Official Documentation
If you want to know more about the deployment, you can check out our official documentation.
Steps to Deploy Airflow on GCP using Terraform
For this tutorial, we are going to make use of the following three repositories.
The first one is the Terraform module repository, which stores the Terraform files that you can use yourself to deploy the external resources Airflow needs and Airflow itself.
The second one is the code used in order to deploy Airflow myself for this blog.
The third one is the repository where the DAGs are located. Airflow will sync this repository to read the DAGs stored in this folder to use in its pipelines.
Prerequisites
These prerequisites are external resources to Airflow that are not created by the module because, following the best practices, they should be independent and may encompass more resources.
In this tutorial, we are going to create them using Terraform code so you can see how to get started with them but ideally, they should be created elsewhere.
Kubernetes cluster (GKE)
Airflow will be running in a k8s cluster, therefore it is required to have one in place.
Virtual Private Cloud (VPC)
The Kubernetes cluster (GKE) needs to be under the same network as the external database that will be created. For this reason, the best practice is to have in place a VPC that hosts the Kubernetes cluster. We will set the network variable in the module in order to have the external database under the same network so they can communicate using private IPs.
Global Address
In order to create the Cloud SQL database, a Global Address range in the same VPC is needed. This address consists of a range of IP private addresses that your Cloud SQL database will use to deploy its resources. You can see the terraform resource here.
Resources created by the module
The terraform module creates the following resources:
A Cloud SQL instance to be used as Airflow’s External Database
A GitHub repository deploy key so that Airflow can access the repository where the DAGs are located
Kubernetes secrets to be used by Airflow
— — Webserver secret key
— — Database credentials
— — Fernet Key
— — GitHub deploy key (gitsync)
The GitHub deploy key is created in this case since we are going to use a GitHub repository to store the DAGs that will be synced with Airflow.
Optionally, the module can also deploy Airflow using the official Helm Chart. This is not recommended since it is better to use a deployment tool such as ArgoCD or Flux to deploy resources in Kubernetes. However, the functionality is included for ease of use and testing purposes. We will be using this option for the tutorial.
Deploying Airflow
The module is applied in this GitHub repository. Besides, we create all the prerequisites in the gke.tf and network.tf files. Then, we can create the database, secrets, and deploy keys simply by applying the module.
In order to use the module like this we would need to set the variables project_id and token in the terraform.auto.tfvars file. The token variable represents the token to authenticate to GitHub so that it can access the repository you want to become the storage for your DAGs.
The Airflow Helm Chart needs to receive the specific values of our deployment, you can check those in the following values.yaml file.
Then, we apply the module with the following extra resources and want to create 32 resources between the database, secrets, etc.
After 16 minutes and 3 seconds, it is finally applied and we have our Airflow deployment ready to use.
In order to check that everything is working fine, we port-forward the UI to check the DAGs.
Here we can check the Web UI at localhost:8080
You can log in now with the user that is automatically created by Airflow which is admin as user and password.
Conclusion
If you want to know more about the Airflow Deployment, you can check out our official documentation.
In essence, deploying Airflow on GCP using Terraform provides cost-effectiveness, enhanced customization, versioning, and strong community support. At Astrafy, it saved us from doing manual and repetitive tasks for similar projects, so don’t hesitate to use it to make your life easier!
Be friendly and stay curious, as the world of workflow automation is vast and full of opportunities to learn and grow! Happy deploying!
Thank you
Thank you for taking the time to read this article! If you have any questions, comments, or suggestions, we would love to love hear from you. If you enjoyed reading this article, stay tuned as we regularly publish technical articles on Google Cloud and how to secure it at best. Follow Astrafy on LinkedIn to be notified for the next article ;).
If you are looking for support on Data Stack or Google Cloud solutions, feel free to reach out to us at sales@astrafy.io.