AWS Open Source Blog

Compliance as code and auto-remediation with Cloud Custodian

Many organizations identify governance and compliance as challenges, and a lack of visibility into cloud infrastructure as a prevalent problem. Companies spend thousands of hours a year maintaining compliance. Automating compliance monitoring and response not only reduces the burden of maintenance, but also increases the visibility across cloud environments. With the increasing cost and human effort to keep up with the compliance, validating and enforcing nearly continuous compliance and auto-remediation will increase the overall security posture and reduce the compliance cost.

We know that implementing infrastructure as code with AWS Cloud Development Kit (AWS CDK) makes it possible to realize Policy-as-Code across AWS resources via Open Policy Agent. The approach is more about the “preventive” control across AWS resources when considering business and governance requirements. In this post, we will discuss how to enable “detective” and “responsive” controls to enforce nearly continuous compliance.

Cloud Custodian is an open source, stateless rules engine that manages AWS environments. It consolidates many of the compliance scripts organizations use into a lightweight and flexible tool. With Cloud Custodian, we can easily set rules that validate and enforce the environment against security and compliance standards.

AWS Lambda provides powerful, real-time, and event-driven code execution. It responds to AWS resources’ behaviors. Cloud Custodian offers policy-level execution against multiple kinds of event streams, including Amazon CloudWatch Events, AWS CloudTrail events, and more. Each Cloud Custodian policy can deploy as an independent Lambda function.

With policy that runs in AWS Lambda, Cloud Custodian enforces compliance as code and auto-remediation, enabling organizations to simultaneously move fast and stay secure. Having the real-time visibility into who made what changes from where, enables us to detect misconfigurations and non-compliance. We can respond quickly to prevent risks from materializing.

The following steps demonstrate how to enable nearly continuous compliance with Cloud Custodian and AWS Lambda:

  • Set up AWS resources for testing.
  • Write Cloud Custodian policies.
  • Validate and enforce the policies.

Prerequisites

Enough AWS knowledge to interact with the AWS Management Console and to spin up Amazon Elastic Compute Cloud (Amazon EC2) instances makes the following steps more manageable.

Creating an EC2 environment in AWS Cloud9

We use the AWS Cloud9 environment for the rest of the post. Follow these instructions to create an Amazon Linux Cloud9 EC2 environment as the workspace.

Note: Cloud Custodian policy execution may alter AWS resources. For the purpose of this blog post and learning, do not try this in production. Use a test or sandbox account.

To begin, install Cloud Custodian:

$ python3 -m venv custodian
$ source custodian/bin/activate
(custodian) $ pip install c7n       #Install AWS package

Getting started

Set up AWS resources for testing.

Create an Amazon EC2 instance

Launch an Amazon EC2 instance (we can use a t2.micro, for example) and create the tag Custodian-Testing.

Any value works and we can create the tag either during EC2 creation or add it afterwards. The Amazon EC2 instance validates on the existence of tag Custodian-Testing by one of the Cloud Custodian policies following.

Confirm that the created EC2 instance appears and that its tag is Custodian-Testing.

Screenshot of console displaying the instance created.

Create an AWS IAM policy

When Cloud Custodian policy is running, necessary Lambda functions are automatically created. We must specify Lambda roles with permissions for operations on the AWS resources in the policies. Based on the Cloud Custodian policies, we must create IAM policy with the following permissions.

Note: Replace the variables (region, account_id, ec2_id) with your region, account ID, and the EC2 instance ID. The ec2-tag-compliance-mark policy marks and stops the previously created EC2 instance.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "ec2:StopInstances",
            "Resource": "arn:aws:ec2:${Region}:${account_id}:instance/${ec2_id}"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags",
                "ec2:DescribeInstances",
                "ec2:UpdateSecurityGroupRuleDescriptionsEgress",
                "ec2:DescribeSecurityGroups",
                "ec2:UpdateSecurityGroupRuleDescriptionsIngress"
            ],
            "Resource": "*"
        }
    ]
}

Create an AWS IAM role

Create an IAM role and attach the preceding policy, AWSLambdaBasicExecutionRole, and AWSConfigRulesExecutionRole policies for Lambda and AWS Config rule execution.

Screenshot of the policy summary.

Write Cloud Custodian policies

Cloud Custodian policies are YAML files, making it straightforward to write as it’s in human-readable format.

The policies usually include the following:

  • The type of resource to run the policy against
  • Filters to narrow down the set of resources
  • Actions to take on the filtered set of resources

To find out more, check out Cloud Custodian documentation.

To begin, log in to the AWS Cloud9 terminal and use the IDE in the following steps.

Set up the environment variable with the ARN of the role created preceding in the following format:

arn:aws:iam::xxx:role/xxx

Remember to replace the variable ${Custodian_Lambda_Role_Arn_Value } according to your environment:

export Custodian_Lambda_Role_Arn=${Custodian_Lambda_Role_Arn_Value}

Generate the policy file with variable Custodian_Lambda_Role_Arn

cat > custodian_polices.yml <<EOF
policies:
- name: ec2-invalid-ami
  resource: ec2
  description: |
    Find all running EC2 instances that are using invalid AMIs and stop them
  mode:
    type: periodic
    schedule: rate(1 day)
    role: ${Custodian_Lambda_Role_Arn}
  filters:
    - "State.Name": running
    - type: value
      key: ImageId
      op: in
      value:
          - ami-02bcbb802e03574ba
#   actions:
#     - stop

- name: sg-add-permission
  resource: security-group
  description: |
    Filter any security group that
    allows 0.0.0.0/0 or ::/0 (IPv6) ingress on port 22, remove
    the rule
  mode:
    type: cloudtrail
    role: ${Custodian_Lambda_Role_Arn}
    events:
      - source: ec2.amazonaws.com
        event: AuthorizeSecurityGroupIngress
        ids: "requestParameters.groupId"
      - source: ec2.amazonaws.com
        event: RevokeSecurityGroupIngress
        ids: "requestParameters.groupId"
  filters:
    - or:
      - type: ingress
        Ports: [22]
        Cidr: "0.0.0.0/0"
      - type: ingress
        Ports: [22]
        CidrV6: "::/0"
#   actions:
#     - type: set-permissions
#       # remove the permission matched by a previous ingress filter.
#       remove-ingress: matched

- name: ec2-tag-compliance-mark
  resource: ec2
  description: | 
    Find all non-compliant tag instances to stop in 1 day.
  mode:
    type: config-rule
    role: ${Custodian_Lambda_Role_Arn}
  filters:
    - "tag:Custodian-Testing": present
    - "tag: maid_status": absent
  actions:
    - type: mark-for-op
      op: stop
      days: 1
EOF

The three policies in this example fulfill the following tasks:

  • Find all running EC2 instances that are using invalid AMIs and stop them. A Lambda function will be created through the policy execution; it would be invoked by CloudWatch scheduled events.
  • Filter any security group that allows 0.0.0.0/0 or ::/0 input on port 22 and remove the rule. A Lambda function will be created through the policy execution; it would be invoked by CloudTrail events.
  • Find all non-compliant tagged EC2 instances to stop in one day. This creates a Lambda function and an AWS Config rule.

Make sure you know the effects before uncommenting action sections of the preceding policies.

Validate and enforce the Cloud Custodian policy

We must validate Cloud Custodian policies against the JSON schema before processing.

$ custodian validate custodian_polices.yml

Screenshot of output once you run the command to validate the policies.

DryRun Cloud Custodian policy

Performming a dry-run command before running the command on infrastructure is usually preferred.

$ custodian run --dryrun custodian_polices.yml -s out

Screenshot of output once you run the command to dry-run Cloud Custodian Policy.

The preceding command created several files in the current directory specified via --output-dir. Each policy provides metrics for Resource Count, Resource Time, and Action Time.

$ less out/ec2-tag-compliance-mark/metadata.json

The following image shows the metrics:

Screenshot of the output displaying the metrics.

Next, we use the report subcommand to summarize and specify the results of the ec2-invalid-ami policy:

$ custodian report  custodian_polices.yml -s out --format grid -p ec2-tag-compliance-mark
```
(custodian) zxuejiao:~/environment/custodian/demo $ custodian report  custodian_polices.yml -s out --format grid -p ec2-tag-compliance-mark
+----------------------------+---------------------+-----------------+----------------+---------------------------+--------------+--------------------+
| CustodianDate              | InstanceId          | tag:Name        | InstanceType   | LaunchTime                | VpcId        | PrivateIpAddress   |
+============================+=====================+=================+================+===========================+==============+====================+
| 2020-10-14 07:26:21.322124 | i-XXX | Cloud Custodian | t2.micro       | 2020-10-14T01:07:07+00:00 | vpc-XXX | XXX.XXX.XXX.XXX       |
+----------------------------+---------------------+-----------------+----------------+---------------------------+--------------+--------------------+
(custodian) zxuejiao:~/environment/custodian/demo $ ```

Run Cloud Custodian policy

Everything looks as expected, so now we are going to run the policies:

$ custodian run custodian_polices.yml -s out

Output displaying the policies created.

The policies execution creates Lambda functions and an AWS Config rule, if they do not already exist. Otherwise, they update accordingly.

Lambda functions:

Screenshot of console displaying the Lambda functions.

AWS Config rule:

Screenshot of console showing the AWS Config rule.

From the AWS Config rule dashboard, there is one non-compliant resource shown.

Screenshot of the console on the screen showing the non-compliant resource.

Check the previously created the EC2 instance. We should see that the ec2-tag-compliance-mark policy execution added the tag named maid_status. The tag marked the instance to stop one day later.

Screenshot of the console displaying the tags within the EC2 instance created earlier.

Additionally, each Lambda function deployed through Cloud Custodian creates a CloudWatch log group named after the rule. We can refer to the logs when troubleshooting.

Screenshot of the CloudWatch log groups.

The following logs show the results of the ec2-tag-compliance-mark policy execution:

```
invoking action:tagdelayedaction
Tagging 1 resource for stop on 2020/10/15
``

 

Output of the 'ec2-tag-compliance-mark' policy command.

We can see the AWS Lambda function validated the defined policies and it conducted auto-remediation.

Clean up

Delete the AWS Cloud9 EC2 environment from the console. Remove the EC2 instance, Lambda functions, and AWS Config rule we created by running Cloud Custodian policies.

Wrap up

This example demonstrates implementing compliance as code with Cloud Custodian and how to use AWS Lambda to complete auto-remediation. Cloud Custodian enables us to define rules and remediation efforts with AWS Lambda as one policy to facilitate a well-managed cloud infrastructure. Every organization has a set of policies to follow for detecting violations and taking remediation actions on their AWS resources.

By using Cloud Custodian and AWS Lambda to enforce compliance as code and auto-remediation, we are able to:

  • Easily construct millions of policies, from simple queries to complex workflows, using the easy-to-read DSL to fulfill remediation automatically.
  • Get governance as a core capability via a YAML DSL rules engine that integrates with serverless for real-time reaction.
  • Achieve nearly continuous compliance by actively enforcing the policies and conforming to internal best practices and guidelines.

DevOps processes can incorporate automated security testing and compliance, bringing us much closer to DevSecOps. Cloud Custodian solves for the challenges of security enforcement, tagging, unused or invalid resources cleanup, account maintenance, cost control, and backups.

Let your imagination run wild and use these tools to get more visibility and control over your entire AWS environment.

Xuejiao Zhang

Xuejiao Zhang

Xuejiao Zhang is a DevOps Consultant at Amazon Web Services. Prior to joining Amazon, Xuejiao worked in Cloud Computing field in a number of E-commerce companies (JD.COM, Rakuten Japan). Xuejiao focuses on the design, implementation and consultation of DevSeOps solutions in Container, Microservices, CICD, ServiceMesh, PaaS and more. You can find her on Github.