How to pass AWS Developer Associate Exam – Part 2

Last modified date

Comments: 0



Continuous Integration is the process where developers are frequently pushing code to a repository and a build server automatically tests and builds the code giving the developer instant feedback on the code. This allows them to find bugs early and fix them, deliver faster as the code is tested and deploy often.

Continuous Delivery ensures that software can be released reliable whenever needed and ensures deployments happen often and are quick. This is done using automated deployments.

Technology stack for CI/CD


AWS’s version control that uses Git.

  • Private Git repositories
  • No size limit on repositories (scales)
  • Fully managed and highly available.
  • Code only in AWS Cloud Account
    • Increased security and compliance
    • Code never leaves AWS
  • Secure
    • Encrypted
    • Access Control
  • Integrates with 3rd party tools such as Jenkins, CodeBuild etc.

CodeCommit Security

  • Interactions are done using Git
  • Authentication is done using Git
    • SSH Keys (configure SSH keys in the IAM console)
    • HTTPS (configure through the AWS CLI Authentication helper or generating your own HTTPS credentials)
    • MFA can also be used
  • Authorization in Git
    • IAM Policies to manager users and role rights
  • Encryption
    • Repositories are automatically encrypted at rest using KMS
    • Encrypted in transit (HTTPS, SSH)
  • Cross account access
    • Use IAM role in your AWS account and use AWS STS (with AssumeRole API)

CodeCommit Differences with GitHub

  • Security
    • GitHub has GitHub Users
    • CodeCommit has AWS IAM users and roles
  • Hosted
    • GitHub hosted by GitHub (managed)
    • GitHub Enterprise – Self hosted and managed on your own servers
    • CodeCommit – managed and hosted by AWS

CodeCommit Notifications

  • Can trigger notifications in CodeCommit using:
    • AWS SNS
    • AWS Lambda
    • AWS CloudWatch Event Rules
  • Use cases for SNS / AWS Lambda functions:
    • Deletion of branches
    • Trigger for pushes happening in master branch
    • Notify external build system
    • Trigger AWS Lambda function to perform codebase analysis e.g. check if someone accidentally committed credentials in the code.
  • Use cases for CloudWatch Event Rules:
    • Trigger for pull request updates (created, deleted, updated etc.)
    • Commit comment events
    • CloudWatch event rules goes into an SNS topic.


CodePipeline allows for Continuous Delivery

Sources – GitHub, CodeCommit, ECR, S3

Build – CodeBuild, Jenkins

Load Testing – 3rd party tools

Deploy – AWS CodeDeploy, Beanstalk, CloudFormation, ECS

CodePipeline Artifacts

Each pipeline stage will create an artifact. Stages are comprised of Action Groups (can be multiple action groups).

These are files are created and stored in S3, and then passed onto the next stage.

CodePipeline Troubleshooting

  • CodePipeline state changes trigger CloudWatch Events which can in return create SNS notifications.
    • E.g. you can create an event for failed pipelines.
  • If CodePipeline fails a stage, pipelines stop and you can get information about the failure in the console
  • AWS CloudTrail can be used to audit AWS API calls
  • If pipeline can’t perform an action make sure the IAM Service Role attached has the correct permissions (IAM Policy)


Used for building and testing an application – an alternative to Jenkins

Fully managed build service meaning that it can scale. No servers to manage or provision and no build queue.

You only pay for usage compared with Jenkins which would be constantly running on an EC2 instance incurring charges.

CodeBuild uses Docker for reproducible builds (AWS managed Docker images) and it’s abilities can be extended by using your own Docker images.

CodeBuild containers are deleted at the end of their execution (success or failed). They cannot be SSH’d into, even then when they are running. Therefore they are not a solution for debugging failing pipelines.

  • CodeBuild is secure
    • Integrated with KMS for encryption of build artifacts.
    • IAM for build permissions
    • VPC for network security
    • CloudTrail for API calls logging and auditing.
  • CodeBuild features
    • The source code can come from GitHub, CodeCommit, CodePipeline, S3.
    • Build instructions can be defined in code buildspec.yml file.
    • Output logs to Amazon S3 and AW CloudWatch Logs.
    • Metrics can be used to monitor CodeBuild statistics.
    • Use CloudWatch Alarms to detect failed build and trigger notifications.
    • CloudWatch events / AWS Lambda as a Glue.
    • SNS notifications
    • Ability to reproduce CodeBuild locally to troubleshoot in case of errors.
    • Builds can be defined with CodePipeline or CodeBuild itself.

How CodeBuild Works

CodeBuild buildspec.yml

  • buildspec.yml must be at the root of the your code.
  • Define environment variables
    • Plaintext variables
    • Secure secrets: use SSM Parameter Store
  • Phases (commands to run)
    • Install phase
    • Pre-build – final commands to execute before build
    • Build – build commands
    • Post build – e.g. zip the files
  • Artifacts are generate from the build
    • Uploaded to S3
    • Encrypted with KMS
  • Cache
    • Files to cache (usually dependencies) to S3
    • Speed up future builds

Then you can add a stage in your pipeline. The action will be CodeBuild.

CodeBuild Local build

CodeBuild local helps you to troubleshoot beyond logs by running CodeBuild locally on your machine (required Docker)

It uses the CodeBuild Agent.

CodeBuild in VPC

By default CodeBuild containers are launched outside your VPC. Therefore they cannot access resources in the VPC.

E.g. if you had RDS in a private subnet, CodeBuild would not be able to access it.

  • But a VPC configuration can be specified
    • VPC ID
    • Subnet IDs
    • Security Group IDs

Then CodeBuild is able to access resources in your VPC (RDS, ElastiCache etc.)

  • The use cases are
    • Running integration tests
    • Data queries
    • Internal load balancers


CodeDeploy is a tool used when you want to automatically deploy an application to many EC2 instances.

Only for EC2 instances.

These are instances not managed by Elastic Beanstalk.

Traditionally this would be done using tools such as Ansible, Terraform, Chef, Puppet.

Instead you can use CodeDeploy which is a managed service.

  • How to make CodeDeploy work
    • Each EC2 instance (or on premise machine) must be running the CodeDeploy agent.
    • The agent is going to continually poll for CodeDeploy work to do.
    • CodeDeploy send an appspec.yml file.
    • Application is then pulled from GitHub or S3.
    • EC2 will run the deployment instructions.
    • CodeDeploy agent will report of success or failure of the deployment on the instance.
  • CodeDeploy things to note
    • EC2 instances are grouped by deployment group (dev, test, prod)
    • Lots of flexibility to define any kind of deployments
    • CodeDeploy can be chained into a CodePipeline and use artifacts from there
    • Re-use existing setup tools, works with any application and auto scaling integration
    • Can perform Blue/Green deployments (only works with EC2 instances and not on premise)
    • Support AWS Lambda deployments.
    • CodeDeploy does not provision instances
  • CodeDeploy primary components
    • Application – unique name
    • Compute Platform – EC2/On premise or Lambda
    • Deployment Configuration – Deployment rules for success / failures
      • For EC2 instances you can specify the minimum number of healthy instances for the deployment.
      • For AWS Lambda you can specify how traffic is routed to your updated Lambda function versions.
    • Deployment Group – Set of EC2 instances that you are going to deploy your application to. Group of tagged instances (allows for gradual deployment). Instances with the correct tag will be deployed to e.g. ENV=DEV.
    • Deployment Type – In-place deployment or Blue/Green deployment
    • IAM Instance Profile – EC2 needs permissions to pull from S3 / GitHub
    • Application revision – application code and appspec.yml file
    • Service role – role for CodeDeploy to perform what it needs
    • Target revision – Target deployment application version

CodeDeploy appspec.yml

  • File Section – how to source and copy files from S3 / GitHub to the filesystem
  • Hooks – set of instructions to do to deploy the new version (hooks can have timeouts). Hook order is:
    1. ApplicationStop
    2. DownloadBundle
    3. BeforeInstall
    4. AfterInstall
    5. ApplicationStart
    6. ValidateService (like a health check)

CodeDeploy Deployment Config

  • Deployment configs:
    • One at a time – one instance at a time, once instance fails then the whole deployment stops
    • Half at a time (50%)
    • All at once – quick but there is downtime (good for dev)
    • Custom – e.g. min healthy host 75%
  • Failures:
    • Instances stay in failed state
    • New deployments will first be deployed to failed state instances
    • To rollback – redeploy old deployment or enable automated rollback for failures
  • Deployment Targets:
    • Set of EC2 instances with tags
    • Directly to an ASG
    • Mix of ASG or Tags to create deployment segments
    • Customization in scripts with DEPLOYMENT_GROUP_NAME environment variables.

CodeDeploy In Place Deployment

Half at a time.

Only works with EC2 and On Premise compute platforms.

CodeDeploy Blue Green Deployment

Load balancer will start routing traffic to a new group, if they all pass then it will shift all the traffic to the new instances.

Works with EC2, On Premise, Lambda, ECS and CloudFormation.

CodeDeploy for EC2 and ASG

  • Deployment to EC2
    • Define how to deploy the application using appspec.yml + deployment strategy
    • Will perform an in-place update to the EC2 instanecs
    • Can use hooks to verify the deployment after each deployment phase
  • Deployment to ASG
    • In-place
      • Updates current existing EC2 instances using in-place update.
      • Instance newly created by an ASG will also get automated deployments
    • Blue / Green deployment
      • A new auto-scaling group is created (settings are copied)
      • Choose how long to keep the old instances
      • Must be using an ELB
CodeDeploy Blue Green deployment. ASG will launch new instances with the updated application, then terminate the V1 instances once the deployment has successfully completed

CodeDeploy Rollbacks

You can specify an automated rollback options.

  • Reasons to rollback
    • When a deployment fails
    • When alarm thresholds are met
    • Disable rollbacks – not perform any

If a rollback needs to take place, CodeDeploy redeploys the last known good revision as a new deployment on the failed instances first.

This is managed using rollback id’s.


CodeStar is an integrated solution (wrapper) that groups all of the individual CI/CD solutions.

  • Groups together
    • GitHub
    • CodeCommit
    • CodeBuild
    • CodeDeploy
    • CloudFormation
    • CodePipeline
    • CloudWatch

Helps to quickly create CICD ready projects for EC2, Lambda and Beanstalk.

Has issue tracking integration with Jira and GitHub

Integrates with Cloud9 to obtain a web IDE (not available in all regions)

Comes with one dashboard to view all your components.

Free service – only pay for underlying resources used.

But has limited customization


CloudFormation is infrastructure as code.

It allows you to declare your AWS infrastructure and resources as code.

  • Benefits
    • No resources manually created
    • It can be versioned in Git
    • Changes to infrastructure can be reviewed through code
  • Code
    • Resources can be tagged to identify how much the stack is costing.
    • Costs can be estimated using the CloudFormation template
    • Savings strategy – delete all resources at 5pm and safely recreate them all at 8am automatically
  • Productivity
    • Ability to destroy and re-create infrastructure on the cloud quickly
    • Automated the generation of Diagram of your templates
    • Declarative programming (no need to figure out order and orchestration)
  • Separation of concern
    • Create many stacks for many apps and many layers e.g.
    • VPC stacks
    • Network stacks
    • App stacks

There are a lot of existing CloudFormation stacks available to get started quickly.

How to use CloudFormation

Templates are uploaded to S3 and then referenced in CloudFormation

Templates can’t be edited once uploaded. You can only upload new templates.

Deleting a stack deletes every single artifact that was created in CloudFormation

  • Creating CloudFormation Templates
    • Manually using the CloudFormation Designed and the console input parameters
    • Automated using YAML file templates and using the CLI to deploy the templates

CloudFormation can be written in JSON or YAML.

CloudFormation building blocks

  • Template components
    • Resources
    • Parameters – dynamic inputs for the template
    • Mappings – static variables
    • Outputs – references to what has been created
    • Conditionals – list of conditions to perform on resource creation
    • Metdata
  • Template helpers
    • References
    • Functions – transform data in the template

Writing CloudFormation Templates

  • YAML supports
    • Key value pairs
    • Nested objects
    • Support Arrays
    • Multi line strings
    • Can include comments

CloudFormation Resources

Resources are mandatory in CloudFormation and represent the different AWS components that will be created.

Resources are declared and can reference each other.

AWS automatically creates, updates and deletes resources.

Resources cannot be dynamically created, they have to be declared within the template.

CloudFormation Parameters

Parameters are a way to provide value into the CloudFormation template.

They are useful if you want to reuse templates across the company and some inputs can not be determined ahead of time.

How to reference a parameter

The Fn::Ref can be leveraged to reference parameters.

This allows parameters to be used anywhere in the template.

The shorthand for a reference in YAML is !Ref.

The function can also be used to reference other elements in the template.

Pseudo parameters can also be used to reference other data such as AccountID, StackId etc.

CloudFormation Mappings

Mappings are fixed variables within the CloudFormation template.

They’re useful to differentiate between environments.

All values are hardcoded within the template.

  • When to use mappings or parameters
    • Mappings are great when you know in advance all the values that can be taken
    • Parameters are required when the values are more user specific

To access mapping values, use the function Fn::FindInMap. It returns a value from a key.

CloudFormation Outputs

Outputs declares optional output values that can then be imported into other stacks (templates).

This allows you to link templates.

Outputs can be viewed in the AWS console or in the AWS CLI.

Exported output names must be unique within your region.

It is useful for scenarios where you have a network CloudFormation and then you can output the variables such as the VPC ID and Subnet ID’s

Allows cross stack collaboration so experts can handle their own part of the stack.

Note that you won’t be able to delete a CloudFormation stack if the outputs are being referenced by another CloudFormation Stack.

An example would be creating a Security Groups and exporting its name:

Then using a cross stack reference a second template can be created that uses that security group. It uses the Fn::ImportValue function:

CloudFormation Conditions

Conditions are used to control the creation of resources or outputs based on a condition.

E.g. If in Dev, Test, Prod or region.

Each condition can reference another condition, parameter value or mapping.

If the Environment is prod then create the prod resources
  • The intrinsic functions that can be used are:
    • Fn::AND
    • Fn::Equals
    • Fn::If
    • Fn::Not
    • Fn::Or

Conditions can be applied to resources outputs etc:

CloudFormation Intrinsic Functions

  • Fn::Ref
    • Used to reference parameters – returns the value of the parameter
    • Used to reference resources – returns a physical ID of the underlying resource
    • Shorthand is !Ref
  • Fn::GetAtt
    • Attributes are attached to any resources created
    • To get the attributes for these resources the Fn::GetAtt function can be used.
    • For example getting the AZ of an EC2 machine.
    • !GetAtt
  • Fn::FindInMap
    • Return named value from a specified key
    • !FindInMap [ MapName, TopLevelKey, SecondLevelKey ]
  • Fn::ImportValue
    • Imports values that are exported in other templates.
  • Fn::Join
    • Joins values with a delimiter
    • !Join [ delimiter, [ comma-delimited list of values ] ]
  • Fn::Sub
    • Substitute variables from text

CloudFormation Rollbacks

If a stack creation fails, by default everything gets rolled back (deleted) and this can be viewed in the logs.

There is an option to disable the rollback so that you can troubleshoot what happened.

If a stack update fails, it will automatically rollback to the previous known working state

Ability to see in the logs what happened in the error messages.

CloudFormation ChangeSets

ChangeSets tells you what changes are going to be made to a stack before you update it.

It won’t tell you if it will be successful, just what will happen.

CloudFormation Nested Stacks

Nested stacks are stacks as part of other stacks

The allow to isolate repeated patterns / common components in separate stacks and call them from other stacks.

E.g. a load balancer configuration or security group that is re-used

Nested stacks are considered best practice.

To update a nested stack always update the parent (root) stack.

  • Difference between a Cross stack and a nested stack
    • Cross stacks is useful when stacks have different lifecycles. The outputs exports and Fn::ImportValue functions are.
    • Cross stacks are also useful when you need to export values to many other stacks e.g. a VPC Id.
    • Nested stacks are useful when components must be re-used and recreated e.g. an ALB. This nested stack is only important to the higher level parent stack.


StackSets Create, Update or Delete stacks across multiple accounts and regions with a single operation.

Administrator accounts create StackSets.

When you update a stack set, all associated stack instances are updated throughout all accounts and regions.

CloudFormation Drift

CloudFormation allows to create infrastructure but it doesn’t protect against manual configuration changes. e.g. using the console – this is called a drift.

CloudFormation drift tells you whether a resource has drifted.

AWS Monitoring & Audit: CloudWatch, X-Ray and CloudTrail

  • CloudWatch
    • Metrics
    • Logs
    • Events – send events when certain events happen
    • Alarms – react in real time
  • X-Ray
    • Troubleshooting performance and errors
    • Distributed tracing of microservices
  • CloudTrail
    • Internal monitoring of API calls being made
    • Audit changes to AWS resources by users

CloudWatch Metrics

Provides metrics for every service.

Metrics is a variable to monitor e.g. CPU Utilization.

Metrics belong to namespaces.

Dimension is an attribute of a metric (instance id, environment). You can have up to 10 dimensions per metric.

Metrics have timestamps.

Can create CloudWatch metrics dashboard for easy visualisation of the metrics.

  • CloudWatch has EC2 detailed monitoring
    • EC2 instance metrics has metrics every 5 minutes by default
    • With detailed monitoring data can be extracted every 1 minute for example (at an extra cost)
    • This is useful for prompting an ASG for example. React much quicker.
  • Custom Metrics can be set
    • Define and set your own custom metrics in CloudWatch
    • Ability to use dimensions (attributes) to segment metrics
    • The resolution of the metrics can be set:
      • Standard = 1 minute.
      • High resolution = up to 1 second (StorageResolution API Parameter) at a higher cost
    • Use API call PutMetricData
    • Use exponential backoff in case of throttle errors.

CloudWatch Alarms

Alarms trigger notifications for any metric.

They can be attached to ASG, EC2, SNS etc.

  • Alarm States
    • OK
    • ALARM
  • Evaluation Period
    • Length of time in seconds to evaluate the metric
    • High resolution custom metrics can only choose 10 seconds or 30 seconds.
    • e.g. if NetworkOut < 2000 for 5 minutes then perform alarm

These alarms can be attached as policies to resources.

CloudWatch Logs

Applications can send logs to CloudWatch using the SDK.

  • CloudWatch can collect automatically from
    • Elastic beanstalk
    • ECS
    • Lambda
    • API Gateway
    • etc.

CloudWatch logs can be batch exported to S3 for archival or stream to ElasticSearch cluster for further analytics.

Logs can use filter expressions to search through them

  • Log storage architecture
    • Log groups – arbitrary name, usually representing an application
    • Log streams – instances within application / log files / containers

You can set log expiration policies (e.g. never expire, 30 days…) that can be defined at the Log Groups level.

By default CloudWatch logs never expire by default.

You can use the AWS CLI to watch logs from CloudWatch in the terminal.

To send logs to CloudWatch you need to have the correct IAM permissions.

Security: Logs can be encrypted using KMS at the group level.

CloudWatch Agent and CloudWatch Logs Agent

By default no logs from your EC2 instance will go to CloudWatch.

You need to run a CloudWatch agent on EC2 to push the log files.

The EC2 instance will need an IAM Role with the correct IAM permissions to allow it to send those logs.

CloudWatch log agent can also be setup on-premises too.

Difference between CloudWatch Logs Agent and the Unified Agent

Both for virtual servers (EC2 and on premise)

CloudWatch Logs AgentCloudWatch Unified Agent
Older version of the agentNewer version of the agent
Can only send logs to CloudWatch logsCollects logs to send to CloudWatch logs
Collect additional system-level metrics such as RAM, processes etc. More granularity.
Centralized configuration using SSM parameter store

For detailed metrics in CloudWatch, use CloudWatch Unified Agent.

CloudWatch Logs Metric Filter

  • CloudWatch logs can use filter expressions such as
    • e.g. find a specific IP inside a log.
    • Or count occurrences of ‘ERROR’ in your logs
    • Metric filters can be used to trigger alarms

The filters will not retrospectively filter data. Only filters data for events after the filter was created.

CloudWatch Events

  • Schedule
    • CRON jobs
  • Event pattern
    • Event rules to react to a service doing something
    • e.g. CodePipeline state changes
    • Trigger Lambda, SQS, SNS etc.
  • CloudWatch events creates a small JSON document to give information about the change


EventsBridge is the next evolution of CloudWatch events

By default when using CloudWatch Events you are using the default event bus generated by AWS services within your account.

EventBridge adds multiple buses.

  • Partner Event bus
    • Receive events from SAAS service or applications (ZenDesk, DataDog, Auth0 etc.)
    • Other parties can send events into your AWS account
  • Custom Event bus
    • Your application can publish its own events to your AWS account.
  • Event busses can be accessed by other AWS accounts.

Once the events are setup you then create rules on how to process the events (similar to CloudWatch events)

EventBridge Schema Registry

EventBridge can analyze the events in the bus and infer the schema.

The schema registry allows you to generate code for your application, that will know in advance how data is structured in the event bus.

Schema’s can be versioned.

  • Difference between EventBridge and CloudWatch Events
    • EventBridge builds upon and extends CloudWatch Events
    • Uses the same API’s and underlying service
    • The difference is that EventBridge allows extensions to add event busses for your custom application and your third party SAAS apps.
    • EventBridge has the schema registry capability
    • Different name to mark the new capabilities.


Provides a visual analysis of your application.

  • X-Ray advantages
    • Troubleshooting performance (bottlenecks)
    • Understand dependencies in a microservice architecture
    • Pinpoint service issues
    • Find errors and exceptions
    • Are we meeting time SLA
    • Where is the service being throttled
    • Identify users that are impacted.
  • X-Ray compatibility
    • AWS Lambda
    • Elastic Beanstalk – using .ebextensions config file
    • ECS
    • ELB
    • API Gateway
    • EC2 instances even on premise

X-Ray works by using Tracing. This is an end to end way of following a request.

Each component dealing with the request adds it’s own trace.

Tracing is made of segments (and sub segments).

Ability to trace every request or a sample of requests (% example of rate per minute)

X-Ray security uses IAM for authorization and encryption at rest.

  • How to enable X-Ray
    1. Your Code – import the AWS X-Ray SDK.
      • Very little code modification required
      • SDK will then capture calls to AWS Services, HTTP/HTTPS, Database calls (Dynamo etc)
    2. Install the X-Ray daemon or enable the X-Ray AWS Integration
      • X-Ray daemon works as a low level UDP packet interceptor (install on windows/linux/mac)
      • AWS Lambda and other AWS services already run the X-Ray daemon
      • Each application must have the IAM rights to write data to X-Ray

X-Ray Magic

X-Ray service collects data from all the different services and a service map is computed from all the segments and traces. It is graphical so even non-technical people can troubleshoot.

X-Ray Troubleshooting

  • If X-Ray is working on your local machine but not on your EC2 instance
    • Ensure the EC2 instance has the correct IAM role permissions
    • Ensure the EC2 instance is running the X-Ray daemon
  • To enable X-Ray on Lambda
    • Ensure it has the correct IAM execution role with the correct policy (AWSX-RayWriteOnlyAccess)
    • Ensure X-Ray code is imported in the code and enabled.

X-Ray Instrumentation in your code

Instrumentation means to measure performance, diagnose errors and to write trace information.

To instrument code, use the X-Ray SDK.

You can modify application code to customize and annotate data that the SDK sends to X-Ray using interceptors, filters, handlers etc.

  • X-Ray Concepts
    • Segments – each application / service will send them (what is seen in the UI)
    • Subsegments – more detail in the segments
    • Trace – segments collected together to form an end-to-end trace
    • Sampling – decrease the number of requests sent to X-Ray to reduce cost
    • Annotations – Key Value pairs used to index traces and use with filters for searching
    • Metadata – Key Value pairs not indexed and not used for searching
  • The X-Ray daemon / agent has a config to send traces across account.
    • It must have the correct IAM permissions – the agent will assume the role
    • This can allow for a central account just for application tracing

X-Ray Sampling Rules

Sampling allows you to control the amount of data that you record.

Sampling rules can be modified without changing code.

By default the X-Ray SDK records the first request each second (reservoir) and five percent rate of any additional requests.

Reservoir ensures that at least once trace is recorded each second.

Rate are additional samples beyond the reservoir size.

You can create your own rules with the reservoir and rate.

These configs can be set without having to restart or change anything in the application. The X-Ray daemon can perform the rule update automatically.

X-Ray API’s

  • X-Ray Write API (used by the X-Ray daemon)
    • PutTraceSegments (upload segments to X-Ray)
    • PutTelemetryRecords (upload telemetry)
    • GetSamplingRules (retrieve sampling rules)
    • GetSamplingTargets and GetSamplingStatisticSummaries

The X-Ray daemon must have the correct IAM policy authorizing to perform these API calls.

  • X-Ray Read API
    • GetServiceGraph (main graph)
    • BatchGetTraces (retrieves list of traces specified by ID)
    • GetTraceSummaries (retrieve IDs and annotations for traces available for a specified time frame using an optional filter)
    • GetTraceGraph (retrieves a service graph for one or more specified trace)

X-Ray with Elastic Beanstalk

Beanstalk platforms include the X-Ray daemon.

The daemon can be run by setting option in the Elastic Beanstalk console or with a configuration file (.ebextensions/xray-daemon.config)

Make sure the EC2 instance profile has the correct IAM permissions so that X-Ray daemon can function correctly.

Make sure the application code is instrumented with the X-Ray SDK.

The X-Ray daemon is not provided for Multicontainer Docker.

X-Ray and ECS

There are three ways of doing this.

  1. ECS Cluster: Use a container as a daemon on each EC2 instance
  1. ECS Cluster: Side Car pattern – run a container for every application container.
  1. Fargate Cluster: Use the Side Car Pattern

The container port of the X-Ray daemon needs to be mapped to port 2000 UDP, you need to set the environment variable of X-Ray Daemon Address then link the two containers from a networking standpoint.


Provides governance, compliance and audit for your AWS account.

It is enabled by default.

Allows you to get a history of events / API calls made within your AWS account by: Console, SDK, CLI, AWS Services.

These CloudTrail logs can be put into CloudWatch Logs or S3 if you want the logs for longer than 90 days.

A trail can be applied to All regions (default) or a single region.

Use case: If a resource is deleted in AWS, CloudTrail will tell you who did it.

CloudTrail Events

  1. Management Events
    • Operations that are performed on resources in your AWS account
    • e.g. Configuring security, setting up logging
    • By default trails are configured to log management events
    • Can separate Read Events (that don’t modify resources) from Write Events (that may modify resources e.g. try to delete a dynamoDB table)
  2. Data Events
    • By default, data events are not logged (high volume operations)
    • S3 object level activity (GetObject etc..). Can separate read and write events
    • Lambda function execution activity (InvokeAPI)
  3. CloudTrail Insights Events
    • With lots of events, it can hard to find useful events
    • CloudTrail Insights can detect unusual activity in your account.
    • e.g. Inaccurate resource provisioning, hitting service limits, bursts IAM actions
    • CloudTrail Insights analyzes normal management events to create a baseline.
    • Then it continuously analyzes WRITE events to detect unusual patterns.
    • When anomalies are detected they will appear in the console, event is sent to S3, and EventBridge event is generated (for automation)

CloudTrail Events Retention

Events are stored for 90 days by default in CloudTrail.

To keep events beyond this period, log them to S3 and analyze using Athena.

Difference between CloudTrail, CloudWatch, X-Ray

  • CloudTrail (auditing)
    • Audit API calls made by users / services / console.
    • Useful to detect unauthorized calls or root cause changes.
  • CloudWatch (overall monitoring)
    • CloudWatch Metrics overtime for monitoring
    • CloudWatch Logs for storing application logs
    • CloudWatch Alarms to send notifications in case of unexpected metrics
  • X-Ray (granular)
    • Automated trace analysis and Central service map visualisation.
    • Good for distributed services
    • Helpful for debugging
    • Latency, errors and fault analysis
    • Request tracking across distributed systems

AWS Messaging

Services need to communicate with each other.

  • There are two patterns for doing this
    1. Synchronous communication (application to application)
    2. Asynchronous / Event based (application to queue to application)

The problem with synchronous application is that sudden spikes in traffic can cause problems

  • The solution is to decouple applications
    • SQS – queue model
    • SNS – pub/sub model
    • Kinesis – real time streaming
  • Now these services can scale independently


Producers send messages into an SQS queue.

Consumers poll the queue for messages. The queue acts as a buffer.

SQS – Standard Queue

Fully managed service to decouple applications.

  • SQS attributes
    • Unlimited throughput
    • Unlimited number of messages in the queue
    • Default retention policy of messages is 4 days, maximum of 14 days.
    • Low latency (< 10 ms on publish and receive)
    • Limitation of 256KB per message sent

Can have duplicate messages (at least once delivery, occasionally more so your application needs to take this into account)

Can have out of order messages (best effort ordering)

  • Producing Messages
    • Produced to SQS using the SDK (SendMessage API)
    • This message is persisted in SQS until the consumer deletes it
    • Message retention – default 4 days, up to 14 days
  • Consuming Messages
    • Servers, EC2, Lambda etc.
    • Polls SQS for messages (receives up to 10 messages at a time)
    • Process the messages (e.g. insert the message into an RDS database)
    • Delete the messages using the DeleteMessage API
  • Multiple EC2 instance consumers
    • Consumers receive and process messages in parallel.
    • Each consumer will receive a different set of messages when they call the poll function
    • So if a message isn’t processed fast enough by a consumer, it may be received by another consumer. (at least once delivery, and best messages ordering)
    • Consumers the delete the messages after processing them

Consumers can be scaled horizontally to improve throughput of processing

SQS with ASG

Consumers can run on EC2 instances in an ASG.

SQS offers a CloudWatch metric of Queue Length. So a CloudWatch alarm can be setup when queue length exceeds X amount.

SQS helps to decouple between application tiers.

SQS Security

  • Encryption
    • In-flight encryption using HTTPS API
    • At-rest encryption using KMS Keys
    • Client-Side encryption if the client wants to perform encryption/decryption itself
  • Access Controls
    • IAM policies to regulate access to the SQS API
  • SQS access policies (similar to S3 bucket policies)
    • Useful for cross-account access to SQS queues
    • Useful for allowing other services (SNS, S3) to write to an SQS queue.

SQS Access Policy

There are several reasons for needing access policies.

  • Cross account access (principal in policy is the account that needs access)
  • Publish S3 Event Notifications To SQS Queue (sourceArn is the bucket details)
Cross Account Access
S3 Event notification publish to SQS queue

SQS Message Visibility Timeout

When a message is polled by consumers it becomes invisible to other consumers.

Message visibility timeout is 30 seconds. This means the message has to be processed within this 30 second time.

After the message visibility timeout is over, the message will become visible in SQS again.

But if a messaged is not processed within the visibility timeout, it will get processed twice.

If a consumer knows it’s going to take longer than 30 seconds to process the message, there is an API ChangeMessageVisibility to get more time to process the message.

  • Message Visibility too high (hours)
    • If the consumer crashes, it will take a long time before the message becomes visible again.
  • Message Visibility too low (seconds)
    • Could get duplicate processing.

So a good balance for message visibility should be set for your application. If an application knows that it needs slightly longer to process a message it should call the ChangeMessageVisibility API to increase the time.

SQS Dead Letter Queues

There may be scenarios where an application is unable to process a message, so it goes back in the queue. The application reads the message again but again it is unable to process it. And this cycle repeats. Maybe there is something wrong with the message.

The MaximumReceives threshold can be set so that if a message is repeatedly going back into the queue after X number of times it will go into a Dead Letter Queue (DLQ).

You setup this additional queue in AWS and point your main queue to the DLQ.

DLQ is useful for debugging.

Make sure to process the messages in DLQ before they expire. Set the retention to 14 days in the DLQ.

SQS Delay Queues

Delays messages so consumers don’t see them immediately.

Delay up to 15 minutesdefault is 0 seconds (right away)

  • Delays can
    • Set a default at queue level
    • Override default on send using the DelaySeconds parameter.

SQS Long Polling

When a consumer requests messages from the queue, if the queue is empty is can wait for messages to arrive.

This is called Long Polling.

The reason to long poll is that it decreases the number of API calls made to SQS. Whilst increasing efficiency and reducing latency.

Therefore reducing overall cost since there are fewer API calls, reduced CPU usage.

Long polling wait time can be set for between 1 and 20 seconds.

Long polling is preferred over short polling.

Long polling can be enabled at the queue level or at the API level using the WaitTimeSeconds API.

SQS Extended Client

Message size limit is 256KB.

To increase message size use a Java library called SQS Extended Client.

It uses an S3 bucket as a repository for the large data and a metadata pointer in the queue.

A use case is if you are processing video files.

SQS Important API Calls

  • CreateQueue
    • Use MessageRetentionPeriod Parameter to set how long a message should stay in a queue before being discarded.
  • DeleteQueue
    • Delete queue and all messages
  • PurgeQueue
    • Delete all messages in a queue
  • SendMessage
    • Use DelaySeconds Parameter to send message with a delay.
  • ReceiveMessage
    • For polling
  • DeleteMessage
    • For once a message has been processed
  • MaxNumberOfMessages
    • Number of messages received by consumers when polling (ReceiveMessage).
    • Default 1
    • Max 10
  • ReceiveMessageWaitTimeSeconds
    • Long Polling
    • How long to wait before getting a response from the queue
  • ChangeMessageVisibility
    • Change message timeout of more time is required for processing.
  • Batch API’s – help to decrease your overall costs
    • SendMessage
    • DeleteMessage
    • ChangeMessageVisibility


First in first out – ordering messages in the queue.

Ordering guarantee.

However this comes with limited throughput: 300 msg/s without batching. 3000 msg/s with batching.

Exactly-once send capability (by removing duplicates)

Messages processed in ordered by consumer.

  • Deduplication
    • Deduplication interval is 5 minutes
    • If the same message twice within 5 minutes, the second message will be refused
    • Two deduplication methods:
      1. Content-based deduplication: will do a SHA-256 hash of the message body
      2. Explicitly provide a Message Deduplication ID (MessageDeduplicationID)
  • Message Grouping
    • If the same value is specified for the MessageGroupID in an SQS FIFO queue there will only be one consumer and all the messages are in order.
    • To get ordering at the level of subset of messages, specify different values for MessageGroupID
    • Messages that share a common Message Group ID will be in order in the group
    • Each Group ID can have a different consumer (parallel processing)
    • Ordering across groups is not guaranteed.
    • The use case is that if you need ordering for a subset of messages e.g. a consumer to handle only one user at a time, use grouping.


SNS allows you to send one message but have many receivers.

The event producer only sends messages to one SNS topic.

As many event receivers (subscriptions) as needed to listen to the SNS topic notifications.

Each subscriber to the topic will get all the messages (there is a new feature to filter messages)

Up to 10 million subscriptions per topic.

Up to 100,000 topics.

  • Subscribers can be
    • SQS
    • HTTP / HTTPS (with delivery retries)
    • Lambda
    • Emails
    • SMS Messages
    • Mobile Notifications
  • Publishers can be
    • Many AWS services can integrate with SNS
    • CloudWatch alarms
    • ASG notifications
    • S3 (bucket events)
    • CloudFormation (e.g. failed to build stack)
    • etc..

SNS How To Publish

  • Topic Publish (using the SDK)
    • Create a topic
    • Create subscription(s)
    • Publish to the topic
  • Direct Publish (for mobile apps SDK)
    • Create a platform application
    • Create a platform endpoint
    • Publish to the platform endpoint
    • Works with Google GCM, Apple APNS, ADM

SNS Security

  • Encryption
    • In-flight encryption using HTTPS API.
    • At-Rest encryption using KMS keys
    • Client side encryption of the client wants to perform encryption/decryption itself.
  • Access Controls
    • IAM policies to regulate access to the SNS API
  • SNS Access Policies (similar to S3 bucket policies)
    • Useful for cross account access to SNS topics
    • Useful for allowing other services (S3) to write to an SNS topic.

SNS + SQS Fan Out

The problem is that you want a message to be sent to multiple SQS queues.

Push the message once to SNS and then receive them all in SQS queues that are subscribers.

Fully decoupled model and no data loss.

  • SQS allows for
    • Data persistence
    • Delayed processing
    • Retries of work
    • Ability to add more SQS subscribers over time

For this to work, make sure the SQS queue access policy allows for SNS to write to the queues.

  • Use Cases
    • S3 Events to multiple queues
      • There is a limitation in S3. For the same combination of event type (e.g. object create) and prefix (e.g. images/), you can only have one S3 event rule.
      • If you want to send the same S3 event to many SQS queues use fan out.
S3 Fan Out


SNS has FIFO (same as SQS)

  • Similar features as SQS
    • Ordering by Message Group ID (all messages in the same group are ordered).
    • Deduplication using a Deduplication ID or Content Based Duplication.

If using a queue as a subscriber to an SNS FIFO topic, it has to be an SQS FIFO queue

Limited throughput (same limit as an SQS FIFO queue)

Use Case – you want fan out with ordering and deduplication

SNS Message Filtering

JSON policy used to filter messages sent to an SNS topics subscription.

If a subscription doesn’t have a filter policy, it receives every message.


Allows you to collect, process and analyze streaming data in real time. e.g. Application logs, metrics, telemetry data.

  • Kinesis types
    • Data Streams – capture, process and store data streams
    • Data Firehose – load data streams into AWS data stores
    • Data Analytics – analyze data streams with SQL or Apache Flank
    • Video Streams – Capture, process and store video streams.

Kinesis Data Streams

Stream big data in your systems.

Streams are made up of shards that are provisioned ahead of time. The number of shards is set beforehand.

Producers send data into Kinesis. This data is called records.

Records are made up of a Partition Key (determines which shard the record should go) and a Data Blob (up to 1 MB).

The rate of records that can be sent to Kinesis is 1MB / sec or 1000 msg / sec PER SHARD (e.g. if you have 6 shards you can send 6MB / second).

Consumers can be application on EC2, Lambda, Kinesis Data Firehose or Kinesis Data Analytics etc.

The rate of consumption is 2MB / sec (shared) per shard for all consumers. OR 2MB / sec (enhanced) per shard per consumer.

  • Kinesis Billing
    • Billing is per shard provisioned (regardless of the throughput)
    • Can have as many shards as you want.
  • Data Retention
    • Default 1 day, max 365 days.
    • Ability to reprocess (replay) data.
  • Immutability
    • Once data is inserted into kinesis, it can’t be deleted.
  • Ordering
    • Data that shares the same partition goes to the same shard (ordering on partition key level)
  • Producers
    • AWS SDK, KPL (Kinesis Producer Library), Kinesis Agent
  • Consumers
    • Write your own using the KCL (Kinesis Client Library), SDK
    • Managed consumers: Lambda, Firehose etc.

Kinesis Data Streams Security

It is deployed in a region.

Access / authorization is controlled using IAM policies.

Encryption in flight using HTTPS endpoints.

Encryption at rest using KMS

Client side encryption can be implemented but you have to implement this yourself.

VPC endpoints are available for Kinesis to access within the VPC.

All API calls can be monitored using CloudTrail.

Kinesis Producers

Puts data records into data streams.

  • Data records consist of
    • Sequence number (unique per partition key within shard)
    • Partition key (must specify when putting records into streams)
    • Data blob (up to 1 MB)
  • Producers can be:
    • SDK
    • KPL (Kinesis producer Library)
    • Kinesis Agent (monitor log files)

Write Throughput – 1MB / sec or 1000 records / sec per shard.

  • PutRecord API to send a record into Kinesis.
  • PutRecords API for batching to reduce costs and increase throughput.

It is important to have a highly distributed partition key to avoid “hot partition” – where traffic to shards is not evenly distributed. e.g. use a device ID which is very unique over a phone brand which is not unique.

The Data Record partition key is hashed then sent to Kinesis and to a specific shard within Kinesis.

  • Kinesis ProvisionedThroughputExceeded Exception
    • Occurs when throughput limits are exceeded.
    • SOLUTION: Use a highly distributed partition key.
    • SOLUTION: Implement retries with exponential backoff.
    • SOLUTION: Increase shards (shard splitting) (scaling)

Kinesis Consumers

Get data records from the stream and process them

  • Consumers can be
    • Lambda functions
    • Data Analytics
    • Data Firehose
    • Custom consumer using the SDK (classic or enhanced fan out)
    • Kinesis client library (KCL) – library to simplify reading from data streams.
  • Difference between a Classic Shared Fan Out Consumer and an Enhanced Consumer:
    • Classic fan out consumer has a maximum of 2MB/sec per shard across all consumers (so if you have 4 consumers the throughput is shared across each consumer – 500KB/sec each.
      • Uses the GetRecords() API call. 5 calls/second
      • Latency ~200 ms
      • Pull model
      • This is useful when there are a low number of consuming applications
      • Helps to minimise cost.
      • Consumers poll data from kinesis using the GetRecords API call
      • Returns up to 10 MB then throttles for 5 seconds. Or up to 10000 records.
    • Enhanced fan out consumer has a throughput of 2MB per consumer per shard.
      • Uses the SubscribeToShard() API call
      • Shards will push to consumers application at 2MB per second.
      • Latency ~70 ms since the data is being pushed into a consumer.
      • Push model
      • Data pushed to consumers using a streaming method of HTTP/2.
      • Multiple consuming applications from the same stream
      • Higher cost
      • Soft limit of 5 consumer application KCL per data stream but this can be increased by placing a support ticket.

Kinesis Consumers – AWS Lambda

Consumer without managing servers.

Supports both Classic and Enhanced fan out consumers.

Reads records in batches. Batch Size and Batch Window can be configured.

If errors occur the Lambda retries until it succeeds or the data is expired.

Can process up to 10 batches per shard simultaneously.

Kinesis Client Library (KCL)

Java library that helps you read records from the Kinesis data stream with distributed applications sharing the read workload.

Each shard is to be read by only one KCL instance. e.g. 4 shards = max 4 KCL instances

Can’t have more KCL apps than shards, because those extra apps will be doing nothing.

The KCL client will read from the kinesis data stream and the progress for how far it’s been reading will be check pointed into DynamoDB (needs IAM access to DynamoDB).

It will be able to track other workers and share work amongst shards using DynamoDB. So if an application goes down, another application can continue where it left off by getting their checkpoint from DynamoDB.

KCL can run on anything e.g. EC2, Beanstalk but must have the correct IAM roles.

Records are read in order at the shard level.

  • There are two versions of the KCL library
    • KCL v1 (supports only shared consumer)
    • KCL v2 (supports shared and enhanced fan-out consumer)
4 shards, 2 applications
4 shards, 4 applications
6 shards, 4 applications

KCL will automatically detect the increase in shards and manage the load appropriately amongst the applications.

Kinesis Operations – Shard Splitting

Used to split a shard into 2 shards – more shards.

Used to increase stream capacity. (get an extra 1MB/s data in per shard)

Used to divide a “hot shard”

This increases capacity, but also increases cost.

Increases throughput from 3 MB/s to 4 MB/s

The old shard will be closed and will be deleted once the data has expired (depends on the data retention period).

There is no autoscaling in Kinesis Data Streams, but you can manually increase / decrease capacity.

A shard can’t be split into more than 2 shards in a single operation. e.g. splitting 1 shard to 3 shards. You would instead have to perform multiple splits

Kinesis Operations – Shard Merging

Decrease the stream capacity and save costs.

Can be used to group two shards with low capacity. (cold shards).

The old shard will be closed and will be deleted once the data has expired (depends on the data retention period).

Can’t merge more than two shards in a single operation.

Kinesis Data Firehose

Firehose takes data from multiple streams, optionally processes the data using Lambdas and then batch writes the data to destinations.

  • The destinations are
    • AWS S3
    • AWS Redshift (amazon data warehouse but it first writes the data through s3). (COPY command from S3 to redshift)
    • AWS ElasticSearch
    • 3rd party applications
    • Custom HTTP endpoint

All the data can also be sent to a backup S3 bucket or all the failed write data can be sent to an S3 bucket.

Fully managed service.

You only pay for data going through firehose.

  • Near realtime
    • 60 second latency minimum for non full batches
    • Or wait for 32MB of data at a time

Supports many data formats, conversions, transformations and compressions

Supports custom data transformations using Lambda.

Can send failed or all the data to a backup S3 bucket.

Difference between Kinesis Data Streams vs Firehose

Data StreamsFirehose
Streaming service to ingest at large scaleIngestion service to stream data to specific services
Write custom code (producers and consumers)Fully managed
Real time (~200 ms)Near real time (buffer time min 60 seconds)
Manage scaling through shard splitting and merging. No autoscaling.Automatic scaling
Data storage for 1 to 365 days.Pay for only the data going through firehose.
Supports replay capabilityNo data storage (can’t replay data)
Pay for how much capacity you have provisioned

Kinesis Data Analytics (SQL)

Data Analytics takes data from sources such as Data Streams or Firehose.

Then you can write streaming SQL statements to analyze the data in real time.

Once the data has been analyzed it streams the output. The stream can go into Data Sinks.

  • Data Sinks can be:
    • Data Streams – then to lambda or other consuming applications.
    • Firehose – then onto firehose destinations.

Perform data analytics on Kinesis Streams using SQL

Fully managed service

Automatic scaling. Only pay for what goes through the applicaiton.

Real-time analytics – pay for actual consumption rate.

Can create streams from the real time queries.

  • Use cases
    • Time series analytics
    • Real time dashboards
    • Real time metrics

SQS vs SNS vs Kinesis

Data ordering for Kinesis vs SQS FIFO

Depending on use case, a choice will have to be made on whether to use Kinesis or SQS FIFO.

Using SQS FIFO you get a dynamic number of consumers based on the Message Group ID’s e.g. 100 group ID’s means you can have 100 consumers.

Or it may be better to use Kinesis Data stream if you have 10,000 trucks and a very large amount of data to consume and also need data ordering per shard. Use partition key to send the same key to the same shard to provide ordering.


  • Serverless services in AWS
    • Lambda
    • DynamoBD
    • Cognito
    • API Gateway
    • S3
    • SNS and SQS
    • Firehose
    • Aurora Serverless
    • Step functions
    • Fargate
  • Downsides of EC2 instances
    • Virtual servers in the cloud
    • Limited by RAM and CPU
    • Continuously running
    • Scaling means intervention to add / remove servers
  • Lambda
    • Virtual functions
    • Limited by time (short executions)
    • Run on-demand
    • Scaling is automated
  • Benefits of Lambda
    • Easy pricing
      • Pay per request and compute time.
      • Free tier of 1,000,000 lambda requests and 400,000 GBs of compute time
    • Integrates well with other AWS services
    • Easy monitoring through CloudWatch
    • Easy to get more resources per function (up to 10 GB RAM)
    • Increasing RAM will also improve the performance of the CPU and network.

Lambda supports all major languages such as Node, Python, Java but it also has a Custom Runtime API to allow for other languages.

Lambda Container Image – The container image must implement the Lambda Runtime API. ECS/Fargate is still preferred for running arbitrary Docker images

  • Lambda Pricing
    • Pay per calls
      • First X number of calls are free
      • Then pay per million requests afterwards
    • Pay per duration (in 1ms increments)
      • X GB/seconds of compute time free per month
      • Then pay per X GB/seconds
    • Lambda is cheap

Lambda Synchronous Invocations

Synchronous invocation occurs when using the CLI, SDK, API Gateway, Application Load Balancer.

Waiting for the result.

Error handling must happen on the client side (retries, exponential backoff etc.)

  • Services than can synchronously invoke lambda functions
    • User Invoked
      • ELB (ALB)
      • API Gateway
      • CloudFront
      • S3 Batch
    • Service Invoked
      • Cognito
      • Step Functions
    • Other services
      • Lex
      • Alexa
      • Kinesis Data Firehose

Lambda and Application Load Balancer

There are two ways of exposing a lambda function directly to the public internet

  1. Application Load Balancer
    • HTTP gets transformed into JSON
    • ALB converts the JSON response back to HTTP
  2. API Gateway

The lambda function must be registered in a target group

Request Payload for Application Load Balancer to Lambda
  • Application Load Balancer (ALB) Multi-header values
    • When this is enabled, multi-value headers and query string parameters that are sent with multiple values are shown as arrays within the Lambda event and response objects.
    • This is a setting that must be enabled.


Another type of synchronous invocation of lambda.

Lambda@Edge are lambdas deployed with your CloudFront CDN in whichever regions your Lambda is in.

This is useful if you want to run a global lambda alongside your CloudFront distributions or implement request filtering before reaching your application.

  • Benefits
    • Build more responsive applications
    • No servers to managed
    • Lambdas deployed globally
    • Customize what goes through your CDN
    • Only pay for what you use.
  • Lambda@Edge capabilities
    • Change CloudFront requests and responses
      1. Viewer Request – After CloudFront receives a request from a viewer.
      2. Origin Request – Before CloudFront forwards the request to the origin.
      3. Origin Response – After CloudFront receives the response from the origin.
      4. Viewer Response – Before CloudFront forwards the response to the viewer.

This also allows you to generate responses to the viewers without ever sending the request to the origin using Viewer Request and Viewer Response Lambda options.

This allows you to build Global Applications:

  • Lambda@Edge Use Cases
    • Website Security and Privacy
    • Dynamic Web Application at the Edge
    • SEO
    • Intelligently route across origins and data centres
    • Bot mitigation at the edge
    • Real time image transformation
    • A/B testing
    • User authentication and authorization prior to reaching the origin
    • User prioritization
    • User tracking and analytics

Lambda Asynchronous Invocations and DLQ

Triggered by S3, SNS, CloudWatch Events etc.

  • Example
    • Events are placed in a queue
    • Lambda attempts to read
    • It retries on errors
      • 3 tries in total
      • 1 minute after first attempt
      • Then 2 minutes wait
    • Make sure the processing is idempotent (same value returned every time) (in case of retries)
    • If the function is retried you will see duplicate log entries in CloudWatch
    • Can define a DLQ (dead letter queue)
      • SNS
      • SQS
      • For failed processing
      • Will require the correct IAM permissions.

Asynchronous invocations allow you to speed up the processing if you don’t need to wait for the result as you can start multiple parallel processes in the background.

  • Services than can asynchronously invoke lambda functions
    • S3 (S3 bucket events)
    • SNS
    • CloudWatch Events / EventBridge
    • CodeCommit
    • CodePipeline
    • CloudWatch Logs
    • SES
    • CloudFormation
    • Config
    • IOT
    • IOT Events

Lambda and CloudWatch Events / EventBridge

  • CRON jobs or Rate can trigger lambdas
  • CodePipeline EventBridge Rule

Lambda and S3 Event Notifications

  • S3 Notifications for
    • ObjectCreated
    • ObjectRemoved
    • ObjectRestore
    • etc.

Use case is generating thumbnails. Image is uploaded to S3 then Event notification triggers a lambda.

S3 Event notifications typically deliver events in seconds, but sometimes take a minute or longer.

If two writes are made to a single non-versioned object at the same time, it is possible that only a single event notification will be sent.

To avoid this, ensure that versioning is enabled on the bucket to ensure every event is sent.

Lambda Event Source Mapping

  • Applies to
    • Kinesis Data Streams
    • SQS and SQS FIFO queues
    • DynamoDB Streams

These services are in common that Lambda has to POLL these services to get data. (records polled from the source)

In these cases the Lambda is invoked synchronously.

The way it works is a Lambda will have an internal Event Source Mapper that polls streams for records. The polling returns a batch of data and the lambda is invoked synchronously with event batch.

There are two categories of event source mapper:

  1. Streams
  2. Queues

Lambda Event Source Mapping – Streams

This applies to Kinesis Streams and DynamoDB Streams.

An event source mapping creates an iterator for each shard that processes items in order.

It can begin reading starting with new items, from the beginning, or from timestamp.

Processed items are not removed from the stream (other consumers can read them).

  • Use case
    • Low Traffic Stream – use batch window to accumulate records before processing
    • High traffic Stream – have lambdas process multiple batches in parallel.
  • Up to 10 batch processors per shard
  • Each batch will have in order processing for each partition key.
  • Streams and Lambdas Error Handling
    • If the function returns an error, the entire batch is reprocessed until the function succeeds or the items in the batch expire.
    • To ensure in-order processing, processing of the affected shard is paused until the error is resolved.
    • The event source mapping can be configured to:
      • Discard old events – discarded events can go to a destination.
      • Restrict the number of retries
      • Split the batch on error (to work around the lambda timeout issue so at least the lambda can process some of the batch).

Lambda Event Source Mapping – Queues

Applies to SQS and SQS FIFO

SQS is polled by a lambda event source mapping. Lambda is invoked synchronously with the event batch

  • The way it works
    • Event Source Mapping will poll SQS using long polling
    • Batch size can be specified (1 – 10 messages)
    • It is recommended to set the queue visibility timeout to 6x the timeout of your lambda function.

To use a DLQ, set it up on the SQS queue, NOT on the lambda. DLQ for lambda is only for async invocations. Or use a lambda destination for failures.

  • Lambda supports in-order processing for FIFO queues
    • Scales up to the number of active message groups (Group ID setting)
  • For standard queues items aren’t guaranteed to be processed in order.
    • Scales up to process a standard queue as quickly as possible

When an error occurs, batches are returned to the queue as individual items and therefore may be processed in different grouping to the original batch.

Occasionally the event source mapping might receive the same item from the queue twice, even if no function error occurred. Ensure idempotentcy enabled.

Lambda deletes items from the queue once they have been successfully processed

The source queue can be configured to send items to a dead letter queue if they can’t be processed.

Lambda Event Mapper Scaling

  • Kinesis Data Streams and DynamoDB Streams
    • One lambda invocation per stream shard
    • With parallelization, up to 10 batches processed per shard simultaneously.
  • SQS Standard
    • Lambda adds 60 more instances per minute to scale up
    • Up to 1000 batches of messages processed simultaneously
    • Messages with the same GroupID will be processed in order
    • Lambda function scales to the number of active message groups

Lambda Destinations

For asynchronous invocations you can define the destination for both successful and failed events.

  • These destinations can be
    • SQS
    • SNS
    • Lambda
    • EventBridge

It is recommended to use destinations and not DLQ.

For Event Source Mapping the destination is only for failed (discarded events).

  • These destinations can be
    • SQS
    • SNS

You can send events to a DLQ directly from SQS.

Lambda Permissions – IAM Roles and Resource Policies

Lambda Execution Role – An IAM role must be attached to the lambda function to grant the Lambda permissions to AWS services and resources.

  • Sample managed policies
    • AWSLambdaBasicExecutionRole – Upload logs to CloudWatch
    • AWSLambdaKinesisExecutionRole – Read from Kinesis
    • AWSLambdaDynamoDBExecutionRole – Read from DynamoDB Streams
    • AWSLambdaSQSQueueExecutionRole – Read from SQS
    • AWSLambdaVPCAccessExecutionRole – Deploy Lambda function in VPC
    • AWSXRayDaemonWriteAccess – Upload trace data to X-Ray

When using event source mapping to invoke the lambda function, lambda is reading the event data and therefore needs the correct IAM role to do so.

The best practice is to create one lambda execution role per function.

Lambda Resource Based Policies – If the lambda is invoked by other services.

Resources based policies gives other accounts and AWS services permissions to use lambda resources.

  • An IAM principal can access lambda:
    • if the IAM policy attached to the principal authorizes it (e.g. user access)
    • Or if there is a resource based policy that authorizes it (e.g. service access)

For example when a service like S3 calls lambda, the resource based policy gives it access to do so.

Lambda Environment Variables

Key Value pair in the string form.

Allows for adjusting function behaviour without updating code.

The environment variables are available to the code.

The lambda service adds its own system environment variables as well.

The environment variables can be encrypted by KMS to store secrets. (encrypted by the Lambda service key or your own CMK)

Lambda monitoring and X-Ray tracing

  • CloudWatch Logs
    • Lambda Execution logs are automatically stored in CloudWatch logs, but it needs to the correct IAM policy to write to CloudWatch logs. This is included in the lambda basic execution role.
  • CloudWatch Metrics
    • Lambda metrics are displayed in CloudWatch Metrics
    • Invocations, Durations, Concurrent Executions
    • Errors, Success, Rate, Throttle
    • Async Delivery failures
    • Iterator age (Kinesis and DynamoDB streams)
  • Tracing with X-Ray
    • Enable in lambda configuration (Active Tracing)
    • Runs the X-Ray daemon for you
    • Use the X-Ray SDK in the code
    • Ensure the lambda function has the correct IAM execution role (AWSXRayDaemonWriteAccess)
    • Environment variables to communicate with X-Ray
      • _X_AMZN_TRACE_ID – contains the tracing header

Lambda in VPC

By default lambdas are launched outside your own VPC.

Therefore it cannot access resources within your VPC. (e.g. RDS, ElasticCache)

But Lambdas can be deployed within your VPC.

The VPC ID, Subnets and Security Groups have to be defined.

Behind the scenes the lambda will create an ENI (Elastic Network Interface) in the subnets.

To create the ENI, the lambda needs the AWSLambdaVPCAccessExecutionRole.

Lambda in VPC – Internet Access

By default the Lambda function inside the VPC does not have access to the internet.

Deploying a Lambda Function in a public subnet does not give it internet access or a public IP.

Deploying a lambda function inside a private subnet gives it internet access if you have a NAT Gateway / Instance.

If you had another resource that the lambda needs access to, it can be done either through the public route or using VPC endpoints.

VPC Endpoints allow access to services privately within your cloud without a NAT

CloudWatch Logs work even without an endpoint or NAT gateway.

Lambda Function Performance

  • RAM
    • From 128MB to 3008MB in 64MB increments
    • The more RAM set, the more vCPU credits assigned (can’t manually set the CPU)
    • At 1792MB a function has the equivalent of one full vCPU.
    • After 1792MB you get more than one CPU so you need to use multithreading in your code to benefit from it.
    • So if your application is computationally intensive, increase RAM
  • Timeout
    • Default 3 seconds (if it runs for more than 3 seconds it will timeout)
    • Max 900 seconds (15 minutes)
  • Lambda Execution Context
    • A temporary runtime environment that initialises any external dependencies of the lambda code.
    • Useful for database connections, HTTP clients, SDK clients
    • It is maintained for some time in anticipation of another lambda function invocation
    • The next lambda invocation can reuse the context which improves performance through initializing things such as connection objects
    • Execution context includes the /tmp directory (write files which will be available across executions)

Always initialize outside the handler (things that take time to initialize e.g. DB connections) so that it is shared across invocations:

  • /tmp space
    • Useful if the lambda function needs to download a big file or needs more disk space
    • use /tmp directory
    • Max size 512 MB.
    • The directory content remains when the execution is frozen providing a transient cache that can be used for multiple invocations. (helpful to checkpoint work).
    • If permanent persistence is required, use S3.

Lambda Concurrency and Throttling

Concurrency limit – 1000 concurrent executions.

A reserved concurrency can be set at the function level to limit the number of concurrent executions.

Each invocation over the concurrency limit will trigger a throttle.

  • Throttle behaviour
    • Synchronous invocation – return ThrottleError 429
    • Asynchronous invocation – retry automatically and then go to DLQ

If you need more than 1000 concurrent executions, open a support ticket.

  • Lambda Concurrency Issues – if you don’t reserve concurrency (limit)
    • Concurrency limit applies to all lambdas in your account.
    • So if the limit is hit, a throttle error will be thrown and other services may be affected.
  • Concurrency and Asynchronous Invocations
    • If reserve limits have been reached for concurrency, additional requests are throttled
    • For throttling errors (429) and system errors (500), lambda returns the event to the queue and attempts to run the function again for up to 6 hours.
    • The retry interval increases from 1 second after the first attempt to a maximum of 5 minutes.
  • Cold starts and provisioned concurrency
    • On cold start for a new instance, the code is loaded and the code outside the handler run (init)
    • If the init is large (code, dependencies, SDK) this process can take some time.
    • First request has a higher latency than the rest.
  • Provisioned concurrency
    • Concurrency is allocated before the function is invoked (in advance)
    • So therefore no more cold starts and all the invocations have low latency.
    • Application Auto Scaling can manage concurrency (schedule or target utilization)
    • Cold starts in VPC have been reduced.

Lambda External Dependencies

If the lambda function depends on external libraries the packages need to be installed alongside your code and zipped together.

This Zip is uploaded to lambda directly if less than 50 MB.

If the zip is more than 50 MB upload it to S3.

Native libraries need to be compiled on Amazon Linux beforehand and the AWS SDK comes by default with every function.

Lambda and CloudFormation

  1. Inline
    • Define the lambda function inline within the CloudFormation template
    • For simple functions
    • Use the Code.ZipFile property
    • But can’t include function dependencies.
  1. S3
    • Store the lambda function zip in S3
    • Refer the S3 zip in the lambda function code
      • S3Bucket attribute
      • S3Key (full path to zip)
      • S3ObjectVersion (recommended)
    • If you update the code in S3 but don’t update the S3Bucket, S3Key or S3ObjectVersion then CloudFormation won’t update your function.
  1. S3 Through Multiple Accounts
    • For the use case where the S3 bucket with the zip is in a different account to where the CloudFormation script is running.
    • The S3 bucket will have to have the correct bucket policy to allow CloudFormation to access the code.
    • And use an Execution role in the CloudFormation account to allow the account to get the data from S3 in another account.

Lambda Layers

  • Use cases
    1. Allows you to create custom runtimes for lambda. e.g. C++ and Rust
    2. Externalize dependencies to reuse them (heavy dependencies can be externalized into a layer if they don’t change frequently)

Lambda Container Images

Deploy Lambda functions as a container images of up to 10GB from ECR.

Allows you to pack complex dependencies and large dependencies into a container.

When containerising application code, the base image must be the Lambda Runtime API.

Base images are available for most major languages such as Java, Python etc.

But you can create your own image as long as it implements the Lambda Runtime API.

This also gives you the ability to test the containers locally using the Lambda Runtime Interface Emulator.

Lambda Versions and Aliases

When working on a Lambda function you can work n $LATEST. (This is mutable)

When ready to publish a lambda function, you create a version.

Versions are immutable and have their own ARN.

Each version of the lambda function can be accessed.

But this isn’t ideal since the ARN will be frequently changing.

  • So to give users a stable endpoint, use aliases
    • Pointers to Lambda function versions
    • Aliases are mutable.
    • Enables Blue/Green deployment by assigning weights to lambda functions
    • Enables stable configuration of event triggers / destinations.
    • Aliases get their own ARN’s
    • Aliases cannot reference aliases.

Lambda and CodeDeploy

CodeDeploy can help to automate traffic shift for Lambda aliases.

This feature is integrated with the SAM framework.

CodeDeploy can make the traffic distribution vary overtime to point to different aliases.


  • Linear Option
    • Grow traffic every N minutes until 100%
    • e.g. Linear10PercentEvery3Minutes, Linear10PercentEvery10Minutes
  • Canary
    • Try X % then switch to 100%
    • e.g. Canary10Percent5Minutes
  • All At Once
    • Immediate
  • Rollbacks
    • Can create Pre and Post traffic hooks to check the health of the Lambda function
    • CloudWatch alarms can be setup to alert of failing lambda’s.

Lambda Limits

Limits are per region.

  • Execution
    • Memory Allocation: 128 MB – 10 GB (64 MB increments)
    • Maximum execution time (900 seconds, 15 minutes)
    • Environment variables (4KB)
    • /tmp space disk capacity in the “function container” – 512 MB
    • 1000 Concurrent executions
  • Deployment
    • Lambda function deployment size (compressed zip) – 50 MB
    • Size of uncompressed deployment (code + dependencies) – 250 MB
    • Use /tmp directory to load other files at startup
    • Environment variables (4KB)

Lambda Best Practices

  • Perform heavy duty work outside the function handler
    • Connect to databases outside of function handler
    • Initialise SDK outside the function handler
    • Pull in dependencies outside of the function handler.
  • Use environment variables
    • For anything that is going to be changing over time. e.g. DB connection strings
    • For sensitive values, they can be encrypted using KMS
  • Minimize deployment package size to its runtime necessities
    • Break down function if it is too big
    • Remember the lambda limits
    • Use lambda layers to reuse libraries
  • Avoid recursive code


Leave a Reply

Your email address will not be published. Required fields are marked *

Post comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.