me at vietnam web summit 2018

This is the story behind my talk: “Cloud Cost Optimization at Scale: How we use Kubernetes and spot instances to reduce EC2 billing up to 80%”.

Now before I tell this story, I will admit first hand that the actual number is lower than 80%.

2015

The story began in mid 2015 when I was employed by one of my ex-employer. It was a .NET framework shop that struggled to scale in both performance and cost at the time. I was hired as developer to work on the API integration but I can’t help to notice too much money was sunk into AWS EC2 billing. Bear in mind I’m not an Ops guy by any mean but you know about startup, one usually has to wear many hats.

At first, when the AWS credit is still plenty, we don’t have to worry much about it. But when it ran low, it’s clearly becoming one of the biggest pain point of our startup.

The situation at the time is like this:

  • There were 2 teams: the core team using .NET framework and the API team using Node.js
  • Core team mostly uses Windows-based instance and API team uses Linux-based.
  • The core team uses a lot more instances than API team.
  • Most EC2 instances are Windows-based. All are on-demand instances. No reserved instances whatsoever 😨.
  • Few are Linux-based instances where we install other linux based applications but there weren’t many of them.
  • On-demand Windows-based instance price is about 30% higher than Linux-based.
  • We use RDS for database.
  • We don’t have any real ops guy as you think these days. Whenever we need something setup, we just have to page someone from India team to create instances for us and then proceed to set them up ourselves.

Now, the biggest sunk cost are obviously RDS and EC2. If I were to assigned to optimize this, I will definitely take a look at those 2 first. But I wasn’t working on it at that time. I was hired to do other things.

At that time, I used Deis - a container management solution (acquired by Microsoft later) for my projects. I experimented shortly with Flynn but ended up not using it.

2016

Spotinst

In 2016, I heard of this startup called Spotinst. I found several useful posts from their blog regarding EC2 cost optimization and find their whole startup ideas very fascinating. For those of you who are not working with infrastructure, the whole idea of Spotinst is to use spot instances to reduce the infrastructure cost for you. And they take some cut from it.

Spotinst automates cloud infrastructure to improve performance, reduce complexity and optimize costs.

Spot instances are very cheap (think 70-90% cheaper vs on-demand) EC2 offering from AWS but comes with a small problem: it can goes away anytime with just 2 minutes notice.

I thought if we can design our workload to be fault tolerant and gracefully shutdown, spot instances will make perfect sense. Or anything like a queue and worker workload would fit as well. Web apps, on another hand, will be a little bit more difficult but totally do-able.

Kubernetes

During 2016, I also learnt about this super duper cool project called Kubernetes. I believe they were at version 1.2 at the time.

Kubernetes comes with the promise of many awesome features but what caught my eyes were this “self-healing” feature. This make perfect complement with spot instances, I thought.

And so I dig a little bit more to see if I can set one up with spot instances and they do support it. Awesome!! 🥰

Now, the only problem left is our core team still need Windows and Kubernetes didn’t support Windows at the time. So my whole infrastructure revamp idea is useless now, or so I thought.

.NET Core

In mid 2016, I learnt about .NET core project. They were around 1.0 release at the time. One of the feature is cross-platform. I thought to myself: I can still salvage this.

Now, please note that I’m a Node.js guy and I don’t know much about .NET aside from my thesis in university. So I asked the lead guy from core team to take a look into it and while there are many quirks, it’s actually not very difficult to migrate our core to .NET Core. It would be time consuming but it’s very much doable. I know that .NET Core is going to be the future so eventually, we will need to migrate to it anyway.

Tests + Migration

While the core team do that, I setup a test cluster with spot instances and learnt Kubernetes. I optimized the cluster setup a little bit and migrate all my projects over to them by the end of 2016. The whole process is quite fast because all the apps I have (Node.js) are already Dockerized and have graceful shutdown implemented. I just need to learn the in-and-out of Kubernetes.

I started with a managed GKE at first using their free $300 credit, to learn the basic of Kubernetes; but then later on use kops to setup a production cluster with AWS.

Some of the changes I did for the production cluster is:

  • Setup instance termination daemon to notify all the containers + graceful shutdown for all the apps.
  • Setup multiple instance groups of various size and availablity zone, mixing spot instances with reserved instances. This is to prevent price spike of certain spot instance group; and minimize the chances of all spot instances going down at the same time.
  • Calculate and provision a slightly bigger fleet then what we actually need so that when there were instances shut off, there won’t be service downgrading. Because spot instances are so cheap, we can do this without worry much about the cost.
  • Watch to see if there were scheduling failture to scale the reserved groups.

2017

At this point, our API apps’ EC2 cost is already very managable. We’re waiting for the core team to migrate over. And we did that in 2017. The overall cost saving for EC2 was around 60-70% because we need to mix reserved instances in and provision a little higher than what we actually need. We were very happy with the result.

What we did back then is actually what Spotinst does but at much smaller scale. And it’s more doable with smaller startups with only 1 ops guy.

And that is my story behind the talk: “Cloud Cost Optimization at Scale: How we use Kubernetes and spot instances to reduce EC2 billing up to 80%”.

Update: #1 on HackerNews. Yay!