Skip to content
Blog

Tips From the Trenches: Cloud Custodian–Automating AWS Security, Cost & Compliance

“We’re moving to the cloud.” If you haven’t heard this already, it’s likely you will soon. Moving to the public cloud poses many challenges upfront for businesses today. Primary problems that come to the forefront are security, cost and compliance. Where do businesses even start? How many tools do they need to purchase to fulfill these needs?

After deciding to jump start our own cloud journey, we spun up our first account in AWS and it was immediately apparent that traditional security controls weren’t going to necessarily adapt. Trying to lift and shift firewalls, threat vulnerability management solutions, etc. ran into a multitude of issues including but not limited to networking, AWS IAM roles and permissions and tool integrations. It was clear that tools built for on-premise deployments were no longer cost or technologically effective in AWS and a new solution was needed.

To remedy these discoveries, we decided to move to a multi-account strategy and automate our resource controls to support increasing consumption and account growth. Our answer to this was Capital One’s Cloud Custodian open source tool because it helps us manage our AWS environments by ensuring the following business needs are met:

  • Compliance with security policies
  • AWS tagging requirements
  • Identifying unused resources for removal/review
  • Off-hours are enforced to maximize cost reduction
  • Encryption needs are enforced
  • AWS Security Groups are not over permissive
  • And many more…

After identifying a tool that could automate our required controls in multiple accounts, it was time to implement the tool. The rest of this blog will focus on how Cloud Custodian works, how Code42 uses the tool, what kind of policies (with examples) Code42 implemented and resources to help one get started in implementing Cloud Custodian into their own environment.

How Code42 uses Cloud Custodian

Cloud Custodian is an open source tool created by Capital One. You can use it to automatically manage and monitor public cloud resources as defined by user written policies. Cloud Custodian works in AWS, Google Cloud Platform and Azure. We, of course, use it in AWS.

As a flexible “rules engine,” Cloud Custodian allowed us to define rules and remediation efforts into one policy. Cloud Custodian utilizes policies to target cloud resources with specified actions on a scheduled cadence. These policies are written in a simple YAML configuration file that specifies a resource type, resource filters and actions to be taken on specified targets. Once a policy is written, Cloud Custodian can interpret the policy file and deploy it as a Lambda function into an AWS account. Each policy gets its own Lambda function that enforces the user-defined rules on a user-defined cadence. At the time of this writing, Cloud Custodian supports 109 resources, 524 unique actions and 376 unique filters.

As opposed to writing and combining multiple custom scripts that make AWS API calls, retrieving responses, and then executing further actions from the results, the Cloud Custodian simply interprets an easy-to-write policy that then takes into consideration the resources, filters and actions and translates them into the appropriate AWS API calls. These simplifications make this type of work easy and achievable for even non-developers.

Now that we understand the basic concepts of Cloud Custodian, let’s cover the general implementation. Cloud Custodian policies are written and validated locally. These policies are then deployed by either running Cloud Custodian locally and authenticating to AWS or in our case via CI/CD pipelines. At Code42, we deploy a baseline set of policies to every AWS account as part of the bootstrapping process and then add/remove policies as needed for specific environments. In addition to account specific policies, there are scenarios where a team may need an exemption, as such, we typically allow an “opt-out” tag for some policies. Code42 has policy violations report to a Slack channel via webhook created for each AWS account. In addition, we also distribute the resources.json logs directly into a SIEM for more robust handling/alerting.

Broadly speaking, Code42 has categorized policies into two types – (i) notify only and (ii) action and notify. Notify policies are more hygiene-related and include policies like tag compliance checks, multi-factor authentication checks and more. Action and notify policies are policies that take actions after meeting certain conditions, unless tagged for exemptions. Action and notify policies include policies like s3-global-grants, ec2-off-hours-enforcement and more.  The output from the custodian policies are also ingested into a SIEM solution to provide more robust visualization and alerting. This allows the individual account owners to review policy violations and perform the assign remediation actions to their teams. For Code42, these dashboards provide both the security team and account owners the overall health of our security controls and account hygiene. Examples of Code42 policies may be found at GitHub.

What policies did we implement?

There are three primary policy types Code42 deployed; cost-savings, hygiene and security. Since policies can take actions on resources, we learned that it is imperative that the team implementing the policies must collaborate closely with any teams affected by said policies in order to ensure all stakeholders know how to find and react to alerts and can provide proper feedback and adjustments when necessary. Good collaboration with your stakeholders will ultimately drive the level of success you achieve with this tool. Let’s hit on a few specific policies.

Cost Savings Policy – ec2-off-hours-enforcement

EC2 instances are one of AWS’s most commonly used services. EC2 allows a user to deploy cloud compute resources on-demand as necessary, however there are many cases where the compute gets left “on” even when it’s not used, which racks up costs. With Cloud Custodian we’ve allowed teams to define “off-hours” for their compute resources. For example, if I have a machine that only needs to be online 2 hours a day, I can automate the start and stop of that instance on a schedule. This saves 22 hours of compute time per day. As AWS usage increases and expands, these cost savings add up exponentially.

Hygiene Policy – ec2-tag-enforcement

AWS resource tagging is highly recommended in any environment. Tagging allows you to define multiple keys with values on resources that can be used for sorting, tracking, accountability, etc. At Code42, we require a pre-defined set of tags on every resource that supports tagging in every account. Manually enforcing this would be nearly impossible. As such, we utilized a custodian policy to enforce our tagging requirements across the board. This policy performs a series of actions as actions described below.

  1. The policy applies filters to look for all EC2 resources missing the required tags.
  2. When a violation is found, the policy adds a new tag to the resource “marking” it as a violation.
  3. The policy notifies account owners of the violation and that the violating instance will be stopped and terminated after a set time if it is not fixed.

If Cloud Custodian finds tags have been added within 24 hours, it will remove the tag “violation.” If the proper tags are not added after, the policy continues to notify account owners that their instance will be terminated. If not fixed within the specified time period, the instance will terminate and a final notification is sent.

This policy ultimately ensures we have tags that distinguish things like a resource “owner.” An owner tag allows us to identify which team owns a resource and where the deployment code for that resource might exist. With this information, we can drastically reduce investigation/remediation times for misconfigurations or for troubleshooting live issues.

Security Policy – S3-delete-unencrypted-on-creation

At Code42, we require that all S3 buckets have either KMS or AES-256 encryption enabled. It is important to remember that we have an “opt-out” capability built into these policies so they can be bypassed when necessary and after approval. The bypass is done via a tag that is easy for us to search for and review to ensure bucket scope and drift are managed appropriately.

This policy is relatively straightforward. If the policy sees a “CreateBucket” Cloudtrail event, it checks the bucket for encryption. If no encryption is enabled and an appropriate bypass tag is not found, then the policy will delete the bucket immediately and notify the account owners. It’s likely by this point you’ve heard of a data leak due to a misconfigured S3 bucket.  It can be nearly impossible to manually manage a large scale S3 deployment or buckets created by shadow IT. This policy helps account owners learn good security hygiene, and at the same time it ensures our security controls are met automatically without having to search through accounts and buckets by hand. Ultimately, this helps verify that S3 misconfigurations don’t lead to unexpected data leaks.

Just starting out?

Hopefully this blog helped highlight the power of Capital One’s Cloud Custodian and its automation capabilities. The Cloud Custodian policies can be easily learned and written by non-developers, and provides needed security capabilities. Check out the links in the “Resources” section below regarding Capital One’s documentation, as well as examples of some of Code42’s baseline policies that get deployed into every AWS account during our bootstrap process. Note: these policies should be tuned accordingly to your business and environment needs and not all will be applicable to you.

Resources:

Authors:

Jim Razmus II is director of cloud architecture at Code42. He tames complexity, seeks simplicity and designs elegantly. Connect with Jim Razmus II on LinkedIn.

Byron Enos is a senior security engineer at Code42, focused on cloud security and DevSecOps. Byron has spent the last four years helping develop secure solutions for multiple public and private clouds. Connect with Byron Enos on LinkedIn.

Aakif Shaikh, CISSP, CISA, CEH, CHFI is a senior security analyst at Code42. His responsibilities include cloud security, security consulting, penetration testing and inside threat management. Aakif brings 12+ years of experience into a wide variety of technical domains within information security including information assurance, compliance and risk management. Connect with Aakif Shaikh on LinkedIn.

You might also like: