Reading time: 5 min

CloudFormation Custom Resources With AWS Lambda

I've recently gotten my hands dirty with AWS CloudFormation Custom Resources. There was a very basic implementation in a CloudFormation[...]

I've recently gotten my hands dirty with AWS CloudFormation Custom Resources. There was a very basic implementation in a CloudFormation stack that I was working on and when it was changed in production, it didn't work the way I expected it to. In the middle of a production rollout, I had to wait 1 hour for CloudFormation to timeout before fixing some code on the fly and retrying, which was very frustrating. The main use case for this CloudFormation was for a B2B Whitelabel solution where DNS could be driven from configuration files. This blog post will talk you through what I would do differently this time around.

Table of Contents

What is a CustomResource?

AWS CloudFormation is Amazon's declarative resource provisioning and management tool. You define your resources in either JSON or YAML files, then interact with CloudFormation API's to declaratively enforce state. It has it's pro's and con's, but one con is that not every resource is supported by CloudFormation. For this, CustomResources can be used where you define the invocation of a lambda function, which then responds back to CloudFormation with information about the action you invoked. It provides a lot of flexibility for users but as described below, has certain pitfalls that you should avoid

The Problem

  DNSConfigCustom:
    DeletionPolicy: Retain
    UpdateReplacePolicy: Retain
    Type: Custom::DNSConfigCustom
    Properties:
      ServiceToken: !Ref LambdaArn
      Action: getDNSConfig
      AppName: !Ref AppName
      Param1: !Ref Param1
      Param2: !Ref Param2

 CloudFrontDistribution:
    Type: AWS::CloudFront::Distribution
    Properties:
      DistributionConfig:
        Comment: !Sub '${AWS::StackName} ${Environment}'
        Enabled: true
        Aliases: !GetAtt DNSConfigCustom.CertificateArn
        ViewerCertificate:
          SslSupportMethod: sni-only
          MinimumProtocolVersion: TLSv1.2_2021
          AcmCertificateArn: !GetAtt DNSConfigCustom.CertificateArn
        Logging:
          Bucket: !Sub '${LogsS3Bucket}.s3.amazonaws.com'
          Prefix: cloudfront-web 
        ..... 

The above snippet seems simple enough right? I'm creating a CloudFront distribution and using a CustomResource backed with Lambda to get DNS Configuration & an ACM Cert ARN. Notice anything missing?

ServiceTimeout: 300

This one missing line in the CustomResource has lost me hours in waiting periods. The default timeout if Lambda doesn't respond is 1 hour (3600). 

Unfortunately, the Lambda backing this custom resource hasn't been coded defensively and has multiple points where it will fail without responding to CloudFormation. There was a library of sorts created in-house to handle custom resources in many different places but unfortunately, wasn't built consistently. There's many places where exceptions are thrown and not caught, leading to CloudFormation getting stuck waiting for a response that's never going to arrive.

The Solution(s)

I did an analysis on the setup and refactored a few things:

  1. Added the ServiceTimeout to calls to custom resources, Lambda's have a 15 minute timeout anyway (not sure why Cloudformation Landed on 60 minutes given this information). For my call's, I changed the timeout to 5 minutes as there was no heavy calls from what I could see.
  2. Structured the existing lambda better to handle exceptions and always respond to CloudFormation (Will go through some examples below)
  3. Structured the existing lambda to handle all CloudFormation Lifecycle events (not just Create/Update, it was missing Delete calls)

The Refactored Lambda Architecture

I broke out logic for the lambda into a few functions and wrapped the overall call stack in a try/catch. This try catch ensured that if an exception was thrown, as part of the handling of that exception, we'd still report back to CloudFormation so that we don't have to fallback on the ServiceTimeout.

lambda_handler

As you can see below, there's only 3 lines in the handler that aren't wrapped in a try/except. These have a very low likelihood of throwing an exception. 

The exception clause will always call the send_response function which ensure CloudFormation isn't left hanging

def lambda_handler(event: Dict[str, Any], context: Any) -> Dict[str, Any]:
    """
    Main Lambda handler for CloudFormation custom resource.

    Args:
        event: CloudFormation event object
        context: Lambda context object

    Returns:
        Response dictionary
    """
    logger.info(f"Received event: {json.dumps(event)}")

    request_type = event.get('RequestType')
    physical_resource_id = event.get('PhysicalResourceId')

    try:
        # Route to appropriate handler based on request type
        if request_type == 'Create':
            response_data = handle_create(event, context)
            # Generate physical resource ID for new resource
            organization_name = event.get('ResourceProperties', {}).get('OrganizationName', 'unknown')
            physical_resource_id = f"dns-acm-info-{organization_name}-{context.request_id[:8]}"

        elif request_type == 'Update':
            response_data = handle_update(event, context)
            # Keep existing physical resource ID
            if not physical_resource_id:
                physical_resource_id = context.log_stream_name

        elif request_type == 'Delete':
            response_data = handle_delete(event, context)
            # Keep existing physical resource ID
            if not physical_resource_id:
                physical_resource_id = context.log_stream_name

        else:
            raise ValueError(f"Unsupported request type: {request_type}")

        # Send success response
        send_response(
            event=event,
            context=context,
            response_status='SUCCESS',
            response_data=response_data,
            physical_resource_id=physical_resource_id
        )

        return {
            'statusCode': 200,
            'body': json.dumps(response_data)
        }

    except Exception as e:
        logger.error(f"Error processing request: {e}", exc_info=True)

        # Always send response to CloudFormation to prevent stack from hanging
        error_response_data = {
            'Error': str(e),
            'Message': 'Failed to process custom resource request'
        }

        send_response(
            event=event,
            context=context,
            response_status='FAILED',
            response_data=error_response_data,
            physical_resource_id=physical_resource_id or context.log_stream_name,
            reason=str(e)
        )

        # Return error response
        return {
            'statusCode': 500,
            'body': json.dumps(error_response_data)
        }

send_response

Having this function handle the CloudFormation response ensures consistency across the lambda invocation

def send_response(event: Dict[str, Any], context: Any, response_status: str,
                  response_data: Dict[str, Any], physical_resource_id: Optional[str] = None,
                  reason: Optional[str] = None) -> None:
    """
    Send response to CloudFormation pre-signed URL.

    Args:
        event: CloudFormation event object
        context: Lambda context object
        response_status: SUCCESS or FAILED
        response_data: Data to return to CloudFormation
        physical_resource_id: Unique identifier for the custom resource
        reason: Reason for failure (if applicable)
    """
    response_url = event.get('ResponseURL')

    if not response_url:
        logger.error("No ResponseURL found in event")
        return

    # Use provided physical_resource_id or generate from context
    if physical_resource_id is None:
        physical_resource_id = context.log_stream_name

    response_body = {
        'Status': response_status,
        'Reason': reason or f'See CloudWatch Log Stream: {context.log_stream_name}',
        'PhysicalResourceId': physical_resource_id,
        'StackId': event.get('StackId'),
        'RequestId': event.get('RequestId'),
        'LogicalResourceId': event.get('LogicalResourceId'),
        'Data': response_data
    }

    json_response_body = json.dumps(response_body)

    logger.info(f"Response body: {json_response_body}")

    headers = {
        'Content-Type': '',
        'Content-Length': str(len(json_response_body))
    }

    try:
        response = http.request(
            'PUT',
            response_url,
            body=json_response_body.encode('utf-8'),
            headers=headers
        )
        logger.info(f"CloudFormation response status: {response.status}")
    except Exception as e:
        logger.error(f"Failed to send response to CloudFormation: {e}")