What is Cloud Hygiene and Why is it Important ?
Hi Friends,
I wrote this over the weekend, as I’m not a great party or cocktail person🤪, so I thought it would be valuable for all of us if I spend some time on sharing with you my experience on why it is so important to have the right cloud hygiene from the very beginning of your operations 💚, and what are the risks of not doing so 🧱 . So this can be interesting for any pre-Series B tech company which uses any of the most commonly used cloud infrastructure providers. (AWS/GCP/Azure/Digital Ocean/Oracle Cloud)
Let’s start from the risks
It’s been 15 years since I began consulting numerous growth-stage companies facing cloud horror and needing help with taking care of and cleaning up their cloud infrastructures. To be honest, I felt like a doctor working with a patient who has many illnesses due to neglecting their health. I have seen many cases where companies reach this stage, hire contractors who end up burning money and wasting time. And again… I can’t stress enough how important it is to keep customers and business in mind rather than dealing with technology bottlenecks. So I have put together some company failure stories that I dealt with, and I would like to prevent others from falling into the same trap. The cases are real, but for the sake of privacy, I will refer to them as Company 1, Company2, …,Company{N}.
Case 1
Company1 missed out on a significant opportunity to secure a $500k annual contract with a Fortune 100 company, because they couldn’t make SOC2 on time and a competitor company won the deal. As a result, some investors pulled back their term sheets and the company went from growth mode to a survival mode.
Here is the long story:
Situation: At the point when a friend referred them to me, they had 30 current and former engineers who created 200+ resources in their clouds. The resources were EC2 instances, ECS and EKS clusters, S3 buckets, EBS disks hanging here and there burning money. The problem started when they had to pass a SOC2 compliance because a Fortune 100 company made this a requirement. So now or never 🙂 They tried to hire an in house SRE to do the work, but it turned out that it was too hard to find the right talent and it would take months to recruit someone from other companies.
Solution: So the only way was finding really well experienced SRE contractors from service provider companies and make them report to a fellow advisor on executing the roadmap. What was the outcome? Eventually they had to migrate their whole infrastructure from the AWS management account to sub-accounts, configure encryptions, SSO and 2FA access to the clouds, coordinating with the team the roles of the resources, changing the code to support encryption mechanisms and many more. Sure I helped them along the way as a consultant but it took 7 months to get it done.
Outcome: They lost the contract! So the takeaway for me was that you better get a clean setup right from the very beginning when it is related to your legal and infrastructure, otherwise this can put the whole business on their knees.
Case 2
Company2 was unexpectedly charged $40k by AWS, while they were thinking that they had $100k credits, valid for more than more than 1 year. It turned out that that credits were over and AWS calculations were accurate. So there were no obvious low hanging fruits to optimize costs.
Here is the long story:
Situation:At the point when they came to us, the CEO was literally yelling that their card got charged for $40k, while they thought that they still had credits. And the horror story was that AWS calculations seemed to be accurate they have been really burning that much money. A brief glance at their cloud costs reports was showing that their infrastructure constantly grew by time and was burning more and more money every month. Lambda functions which have been taking 5 seconds 1 year ago, were taking 10+ minutes now and nobody knew what was the cause. Databases which were 100GB a year ago, now were 6TB in size, Kubernetes was crazily autoscaling to 30 nodes every night because a mystery benchmark testing job by a former employee went out of control at some point, traffic costs were $12k a month and nobody knew where it was flowing (eventually it turned out to be a node connecting to their database and pulling resources with an extremely nonoptimal query), etc… And here again… nobody had the time ⏲️ to slow down the development on troubleshoot this whole stuff. And also the company didn’t have the money to afford $40k payment every month. The CEO was yelling “let’s move to GCP! Let’s move to Azure! We have credits there!”. But, you know… credits won’t migrate your infrastructure, right? Especially when you have legacy resources running legacy code. 🙂
Solution: They spent 3 months on hiring a team of expensive cloud certified SREs to deal with their infrastructure, code, cloud and security. Eventually they have been able to bring the infra to a relatively acceptable condition, but the the time and money cost was huge.
Outcome: The amount of time and effort lost on fixing this have put the business out of focus for 3 months. That ultimately led to losing an important customer due to their failure to implement the requested improvements on time. So the takeaway for me was that infrastructure mistakes can have far-reaching consequences beyond just financial costs, as they can also be disrupting the executives and developers and divert attention from critical moments when they should be concentrating on the product.
So what to do to avoid this in future?
There are many advises and best practices, but here are some very important ones that could be easily taken care now.
- Never host any resources in your AWS root account, always divide your AWS accounts to Management, Dev, Prod, Sandbox, etc
- Configure secure access. Have an SSO for providing access to your cloud accounts, and avoid using username/password authentication.
- Don’t rely on credits, that’s a trap! Credits are going to over one day and better to be ready to it. So you would really want to monitor your costs right away with easy to setup tooling. Cloudchipr can help here and I can personally help you to setup the right guardrails for free.
- Automatically enforce tagging best practices. Here is a Tagging Best Practices doc by AWS which describes some of the approaches. You should have something similar in your company wiki and again you should use an easy to integrate tool which will at least drop you a slack message when someone creates a resources without the mandatory tags.
- Make sure that every resource is tagged by its owner. If an employee leaves the company, make sure that there is someone who takes over from them all the cloud resources. Have an automation workflow which will notify you when there are resources left by the former employees.
- Don’t use spot instances unless you completely don’t care if the resource will go down anytime. Keep in mind that spot instances can be time consuming to manage, so you better save time for your engineers rather than money on the cloud during your growth stage. So it’s worth using spot when you really don’t loose any second when the resource goes down, otherwise better to use a right size of an on demand resource.
- Do serverless (Kubernetes) as much as it’s possible. Sometimes engineers debate that Kubernetes is an overkill, however it simplifies a lot and keeps your stack pretty standard. Standard stack has a lot of cost and time benefits. For example it’s easier to find a talent with the right expertise when you have standard stack in your job posting requirements.
- Be Multi-cloud and have a Cloud Agnostic architecture. Don’t bind your application architecture to any of the clouds. In this way you’ll always have more flexibility when it comes to scale and grow.