Control the things you can control
Foreword
The ability to iterate quickly and scale dynamically is invaluable for creating business value. This is one of the major motivations for shifting data workloads to the cloud.
With more and more companies moving to the cloud in the last ten years, there has been a shift from Capital Expenditures (CapEx) to Operational Expenditure (OpEx) in terms of IT expenditures.
While in the past large servers were budgeted and set up on premises, nowadays companies are shifting to the cloud where a “pay-as-you-go” pricing model is the standard. The latter has the benefit to scale on demand but generates much greater variance than the former. If managed properly, cloud costs can lead to significant savings as compared to on-premises installations. But if no governance is set up and developers are not well trained, it can lead to out-of-control bills. And as everybody knows “if something can go wrong, it will”. This is where FinOps enters the game with a set of best practices to manage your cost on the cloud.
This article will focus on how to control your FinOps with Infracost. As most companies have all their infrastructure deployed using Terraform, Infracost is a key player to be fully aware of the cost of new resources that are deployed. This article will touch a few words as well on usage-based costs that are harder to tame. Before digging into those topics, we start with a brief introduction of what FinOps is.
FinOps in a few words
The Cloud FinOps book offers the following definition of the term FinOps:
The term “FinOps” typically refers to the emerging professional movement that advocates a collaborative working relationship between DevOps and Finance, resulting in an iterative data-driven management of infrastructure spending (i.e., lowering the unit economics of cloud) while simultaneously increasing the cost efficiency and, ultimately, the profitability of the cloud environment.
With FinOps, engineers need to be very well acquainted with the cost structure of cloud systems. Mastering this cost structure allows engineers to deploy the right mix of resources for each use case; for instance deploying a GKE cluster with different node pools (each one with different machine types and some in preemptible type in case the workloads are not for production). Cloud also offers fixed rate and it is the responsibility of the engineers to know when to switch from the “pay-as-you-go” pricing model to a fixed rate pricing model. This situation arises often with BigQuery where the “on-demand” pricing model becomes less and less interesting once your BigQuery monthly bill gets in the ten thousands of dollars. This is only one of many examples but such thinking process has to happen for each cloud component.
This article does not aim to define FinOps in detail and great resources are available either online or in books for this purpose. We recommend the following resources to become a FinOps guru:
FinOps is defined around six principles and we will see in the next section how Infracost checks those principles:
Teams need to collaborate
Everyone takes ownership for their cloud usage
A centralized team drives FinOps
Reports should be accessible and timely
Decisions are driven by business value of cloud
Take advantage of the variable cost model of the cloud.
For more details on those cornerstone principles, you can refer to this link.
Last but not least, FinOps is not the responsibility of an isolated team. Company needs to set up a cross-functional team known as a Cloud Cost Center of Excellence (CCoE). This team interacts with the rest of the business to manage the cloud strategy, governance, and best practices. that the rest of the organization can leverage to transform the business using the cloud. The following diagram (taken from the FinOps foundation wesite — link here) describes this collaboration perfectly and where the FinOps team lies. The FinOps team is also the main point of contact for communication and negotiations with the mainstream Cloud providers.
Infracost
Infracost cost in one sentence: “Cloud cost estimates for Terraform in pull requests”.
A short video is worth a thousand words and you will get an excellent overview of Infracost by watching this five minute video from Infracost Co-Founder Hassan Hosseini.
At Astrafy we have implemented Infracost on each of our infrastructure repositories with the following workflow:
A caveat before deep diving in this workflow is that you need a strict GitOps workflow enforced on each of your infrastructure repositories. Every branch linked to a Terraform Cloud workspace needs to be protected so that any code can only be deployed via merge requests. This ensures that the CI pipeline depicted above always runs.
Regarding the CI pipeline, it all starts with a developer making a merge request. This in turn triggers a Gitlab merge request pipeline that will start with the following two steps:
Terraform speculative plan: this is a terraform plan that can not be applied and that will automatically run once you link a terraform workspace to a VCS repository.
Infracost: this generates the cloud cost estimates for the new terraform resources that will be deployed with the code in this merge request. The screenshot below shows how the content is displayed:
CI code for different VCS providers (Github, Azure pipelines, etc.) is well documented on Infracost website.
On top of commenting your merge requests, you should also configure your CI pipeline to send a notification to Slack with the cost details. This can be easily done and is detailed here.
Next step in your pipeline is running your terraform plan and infracost output against OPA policies you have defined. You should define two types of policies:
Generic infrastructure policies: those are policies defined by your governance team and that serve as guardrails to comply with the global cloud governance. For instance, this can be enforcing that any VM deployment on Google Cloud must be in region “europe-west1” or that a machine type can only be of a certain type.
Cost policies: Cost policies enable DevOps and FinOps teams to take actions around cloud costs. As per infracost website example, a policy could be that the total cost per month of the entire infra should be less than 500 USD.
The outputs of all those CI steps are available in the following UI platforms (and also conveyed via slack messages):
Terraform Cloud UI: devs can review the terraform speculative plan directly via the UI
Merge request comments: Infracost outputs with cost details and OPA policies being passed or denied will be displayed as comments in the merge request.
This allows your developers and all the different FinOps stakeholders (the ones mentioned at the end of last section) to be aware and discuss together about those cost changes. Infracost enables the FinOps team and its stakeholders to be proactive on Cloud cost changes instead of being reactive and firefighting unexpected cost changes.
Let us have a look to the different principles of FinOps and how those are checked by using Infracost within a robust CI pipeline:
Usage-based costs
Cloud costs can be split into two categories:
Static costs: those costs are somehow fixed and easy to predict as it reflects a constant usage of a specific resource. For instance when you deploy a VM on Google Cloud, the price will be X USD per month and will not vary. Same goes for the cost of reserving a static IP address, deploying a Cloud SQL instance, etc. Those costs are precisely determined by Infracost that will fetch the exact price from the different cloud providers.
Usage-based costs: those costs are the ones that depend on your usage of the deployed resource. Those costs can often represent the biggest part of your bill due to the “pay-as-you-go” pricing model of most resources on the cloud.
Usage-based costs are the ones that can get your bill to the sky if not managed properly. Good examples of usage-based costs are Google Cloud Functions, Google Cloud Storage and BigQuery. With the “on-demand” pricing model of BigQuery you will pay 5 USD for TB of data processed. Once you start having a few engineers running queries on a daily basis, you will quickly get single queries in the dozen of dollars for a multitude of good or bad reasons (“SELECT *”, no filters applied, etc.). The difficulty for the FinOps team is to be able to estimate those variables costs. Different solutions exist that need to be complemented together.
One of those solutions is Infracost that can be used in estimating those costs. For usage-based resources you can provide a usage file where the approximate usage of each resource has to be defined. Each time a new usage-based resource is deployed, it is the responsibility of the developer to put estimates. Those estimates then serve as a baseline that can be adapted according to real consumption. You can find all the details on how to implement this in Infracost on this page.
While Infracost will help you set a baseline based on your data team assumptions, it is of major importance to complement those assumptions with monitoring and quotas.
Monitoring; FinOps team needs to develop near real-time dashboards for all usage-based resources deployed. Having those dashboards in place allows the FinOps team to be in full control regarding those variable costs and to notify the relevant teams in case some costs go off by a certain threshold. ChatOps via slack notifications should be set up on the different metrics displayed on those dashboards. Once an alert is raised, it is then dispatched to the FinOps slack channel and to a specific slack channel that includes the developers responsible for this cost.
Quotas: setting up quotas makes sure to avoid a specific resource to consume in excess of a certain threshold. Quotas are upper limits that should not be reached under normal circumstances and act as a guardrail against abnormal consumption of a resource. BigQuery is a good example where quotas should be defined with care; you should for instance set up quotas for the maximum number of TB per day and maximum number of TB per user per day. This ensures that the blast radius of inefficient queries is capped to a certain budget per day.
Every time a quota is reached, you need to deep dive into the monitoring dashboard of that resource, discuss with the team responsible for reaching that quota and analyse the root cause. If it is legitimate, then the quota should be increased; if not, quotas did a great job and the root cause analysis with monitoring dashboards will help to write a detailed post-mortem report of this ‘quota reached’ incident.
Conclusion
From FinOps Foundation website:
If it seems that FinOps is about saving money, then think again. FinOps is about making money.
Cloud spend can drive more revenue, signal customer base growth, enable more product and feature release velocity, or even help shut down a data center. FinOps is all about removing blockers; empowering engineering teams to deliver better features, apps, and migrations faster; and enabling a cross-functional conversation about where to invest and when. Sometimes a business will decide to tighten the belt; sometimes it’ll decide to invest more. But now teams know why they are making those decisions.
In this article we have focused on how a specific tool can help the FinOps team gain visibility on the ongoing infrastructure changes. Chances are your company is using terraform to deploy its entire Cloud infrastructure and having a tool that informs you about the cost generated by each bit of code is essential. Infracost does just that and it does it with neat integrations in the most popular CI tools.
If you are looking for support on Data Stack or Google Cloud solutions, feel free to reach out to us at sales@astrafy.io.