Many of our customers have been asking us to re-focus from growth to operation cost optimisation, this is what prompted us to write about the most common pitfalls and mistakes on AWS spending and how you can wow your board of directors with practical actions.
Proactive Vs Reactive, all our customers are focussed on delivery, aka, getting the deployments through all environments (dev; test; prod). Security; governance and testing are the priorities – the project/product delivery focus on the removal of all blockers to move the MVP, and, consistently Performance and Cost optimisation take the backseat in favour of over-provisioning capacity and the reliance on Auto Scaling capacity of the cloud. We recommend our customers to adopt a slightly different approach:
- Don’t launch/release a new App/Api without monitoring, actually don’t run an App/Api without monitoring.
- If overprovision for whatever reason (no previous baseline to compare; no real understanding of traffic/usage; etc) set to revisit and readjust (2x weeks after the initial launch ideal).
- Include your DevOps; Cloud Engineers; SRE’s (or whatever name/title you choose) in the sprint planning or planning sessions so they understand what you are building, no matter how full-stack developer/independent you think your team his collaboration is the key for success.
About Manually Created Platforms/Resources, let me start by saying I understand, AWS web dashboards today are very complete and meant to be easy to use, in fact, it takes longer to create a platform by code than using the web console. The hidden trap is environments will not be consistent and will bite you back 100% either on failed deployments, different versions, different configurations which will most likely cause errors and failures on the final deployment phase. Most likely, when you are cutting costs by deleting resources, you can’t turn-off the whole platform causing dreaded and unwanted downtime because you don’t have a way to quickly bring it back on-line repetadly. One good example of this, one of our enterprise customer we were quick to delete costly resources identified as unnecessary on part of the platform, but, another division was stuck chasing different areas of the business to confirm if they could or not switch off. What we recommend:
- Identify the platform patterns and create templates developers can easily use and deploy with Infrastructure-As-Code and Scripts when necessary, automate! This has the added bonus off speeding up delivery.
- Standardize the technology stack intelligently – this will cut costs and guarantee more collaboration and stability of your platforms across your Dev teams and save considerably on training and hiring.
- Educate by making sure everyone understands why manual setups are a bad quick fix short or long term, and, only allow manual setup at the Development stage and on a localised and isolated environment (the Dev) during Proof-of-Concept and exploration phases.
Compute and networking optimisation, at the core of every platform you build on AWS overprovisioning of capacity is happening, most likely on Compute but also on Networking, this a psychological effect of the “consume first and we send you the bill later, don’t worry you don’t pay the bill”. One tip for networking, it’s crucial you ask yourself “do I need to isolate my apps at the VPC level or will Security Groups be enough?” On our experience filtering access at the Security Group level covers 98% of the use cases and avoids the duplication of platforms so they stay on different Network Segments (aka VPC) and avoids complexity and costs on VPC Peering and Transit Gateways. Lambas concurrency execution times and real memory usage should be monitored, ex. Long executions times could mean refactor and massive cost saves, some of the best practices:
- Ec2 use reserved instances for the baseline and spot or standard prices for Auto-Scaling – this implies you build all your Ec2 instances as stateless and under AutoScaling Groups.
- Everything must be monitored, make sure you enable at least Cloudwatch Advanced Monitoring on all Ec2 instances so you also monitor CPU/Mem/Net and understand patterns and baselines, ex. Observe an abnormal spike of memory after a new release compared with the previous weeks/months baselines.
- This is going to sound strange, but, be creative, glue jobs are expensive so why not run spark jobs on your EKS cluster? Can your apps be refactored to become event-driven (serverless)? Cost reduction can be a great motivator and challenge for Developers – this will definitely up the morale of your best developers.
Monitoring Logging and Alerting, one of the most overlooked areas, maybe because traditionally it is built (or not… ) and only looked at (realise the need for…) when something isn’t working or broke down, not to mention a lack of understanding what is relevant to monitor and reported (alerts). Running your new applications must have this as part of the application platform itself, and, the more complex and decoupled your application platforms are the simpler it should be – but as Picasso said “…simplifying isn’t simple…” – This is not a time to scale down on monitoring, maybe rethink and negotiating SaaS/Tools contracts and support, but, not the quality and insights – without these, you can’t monitor the impact of the changes or even accurately reduce your spending without adversely affecting your platform and as a consequence your business. Some suggestions:
- There are a lot of tools within AWS on Cloudwatch for Capture and Alerting and many more to give you insight from data collected.
- Make Monitoring at all levels standard of the installation/enabling CW Agent on all Ec2 instances.
- Datadog Prometheus (open-source but there are the maintenance and support cost to consider) and ELK managed solutions are also available
- Make Monitoring Logging Alerting part of Infrastructure-As-Code templates.
Non-production environments $$$ cut, one of the advantages of Cloud technology is capacity-on-demand at any time, this means you can schedule when you want to use capacity and only get charged the seconds hours; minutes; requests; bandwidth; etc consumed. A practical suggestion:
- Switch off non-production environments resources that aren’t used with schedules:
Example for Ec2 AutoScaling Groups Linux on Ireland non-upfront instances 9 am to 7 pm weekdays online:
- 100xm5.xlarge Ec2 at a standard month usage of 720 hours (30 days) will cost you $10.950,00 p/m.
- 212 hours on weekends.
- 308 hours (22 days x14 hours p/day) off-peak hours – likely your systems are idle.
- 72% bill savings available ($7.884,00 wasted p/month and $94.608,00 p/year) not including costs with storage etc.
- Because the environments are recreated every day you can ensure the latest updates and fixes are applied.
Other options include:
- Ec2 Hibernation.
- RDS now also supports stopping instances and using the maintenance window to start them again.
- Refactor applications that can become event-driven aka. Serverless and pay for resources used only.
Storage, linear approach to access to data and which DB’s to use, the tendency of storage is to consistently grow, be it files of all types or Database storage (some databases are now on Demand/Serverless), today’s choice of different storage solutions allows you to consider many different approaches which will combine application and cost-effectiveness. Each solution is very much dependant of your platform design and architecture but I want to leave a quick example for backup using AWS S3 advantages:
- Choose from different tiers of storage price, plus, the cost per GB lowers down as the overall usage increases.
- Your data will be more secure than on EBS and available to POST/GET from your applications (you would have to design App/Api architecture accordingly).
- Applications can now use S3 as a Data repository with Data Catalog solutions AWS native and some free.
- RDS/DDB can be stored and recovered from S3.
Security and Governance, cutting on this area more often then none proofs to be a costly mistake many organisations seem to oversight – the security exploits and hackers (amateurs; pros; unknowingly) are still online and looking for exploits and ways to crack your hard built business for fun profit or plain ignorance. Take advantage of the many tools AWS already provides for monitoring advising and alerting, some recommendations:
- Use tools such as AWS Security Hub with AWS Config; Guard Duty and Trusted Advisory on all environments.
- Instruct Developers to validate if new applications has created Security or compliance fails and fix them from lower (Dev) all the way to the high (Prod) environments.
- Schedule at regular intervals maintenance windows or automate them through IaC completely – if you are using Cloud Native or Serverless this should be straightforward.
- Set standards using Infrastructure-As-Code templates everyone can use with the security and compliance as standard principles embed.
- Document any accepted False Positives.
- Make security the responsibility of whom developed it and make the developers responsible for the deployment up to Production – with the tools and processes necessary in place – CI/CD – so they don’t have to invent the wheel every time.
Bottom line, there are many ways to cut costs on your AWS environments don’t tighten the belt too much so you can’t breathe.
Cutting the costs on the right places could mean two more developers or suppliers that you really need while still presenting reductions on the overall investment on AWS with an increase on transactions and revenue – there is room to continue growing.
The secret is to make cost optimisation part of your sprint planning and ongoing Continuous Improvement strategies with the right optics (ticketing; monitoring; logging and alerting) available to your teams, and, making sure they are taking ownership right on day Zero all the way to Production.
Note: This article gives the views of the author, and not the position of Linkedin, nor of Boldlink – SIG and it’s associates.