Rotate EC2 from Autoscaling Group with Terraform

Deploying infrastructure on AWS with terraform is actually a straight forward job. Most of the aws provider resources are highly integrated and match the AWS API tightly. But for some use cases, it becomes clear, that terraform is just a third party software and not endorsed by Amazon itself.

This is the case, when we have an autoscaling group and want to rotate the instances after applying a new launch template.

Overview of the problem

An Autoscaling Group in AWS mainly consists of 3 (for us) important parts:

  • The autoscaling group itself
  • The EC2 instances spawned by the autoscaling group
  • A launch template: The description for the ASG to spawn new EC2-instances

Autoscaling group

Whenever you run terraform apply with an updated launch template, terraform will only create a new version of the launch template and will update the ASG to use the new launch template for future instances.

Autoscaling group future

It won’t change or update the EC2 instances. The currently running instances keep being started with the old launch template!

How is it possible to make terraform terminate the “old” instances?

Status Quo to rotate an ASG

  • AWS Cloudformation has solved this problem already and does a proper rollover.
  • The non-Cloudformation way: Scale the ASG to double its capacity

Approach No. 1

We could try the common approach, to scale the ASG up by factor two and scale it down afterwards.

There is a nice native feature of AWS called Autoscaling Schedule, which is exactly what we need.

So let’s create an ASG in terraform:

resource "aws_autoscaling_group" "asg" {
  name             = "Autoscaling Group Example"
  min_size         = 10
  desired_capacity = 15
  max_size         = 20

  ...
  ...

  # Remove the instances with the oldest launch templates first
  termination_policies = ["OldestLaunchTemplate", "OldestLaunchConfiguration"]

}

And add a schedule event to it, which scales the ASG up for a short time (here: 10m). Your instances should start in less than that and afterwards, when it scales back to normal, the old ones will get removed.

resource "aws_autoscaling_schedule" "asg_rotate_instances" {
  scheduled_action_name  = "Rotate instances and remove old ones"
  min_size               = 2 * aws_autoscaling_group.asg.min_size
  desired_capacity       = 2 * aws_autoscaling_group.asg.desired_capacity
  max_size               = 2 * aws_autoscaling_group.asg.max_size
  start_time             = timestamp()
  end_time               = timeadd(timestamp(), "10m")
  autoscaling_group_name = aws_autoscaling_group.asg.name
}

This is implemented in pure terraform, but we’re hitting now a small corner case: For every terraform apply run, we replace the instances, even if there is no need.

This is unfortunate, but there is no better native terraform way. I had an idea to search via aws_instances for the old instances. Then count them and adjust the autoscaling schedule to just scale by desired_capacity+old_ones. But for the most important case (count=0, no old instances), the aws_instances data block will fail, as it doesn’t match anything.

Approach No. 2

So we can’t run native terraform. Can we misuse some concepts?

  • We’ve got the extract data in the terraform state, so we could automate some calls with the awscli. :thinking_face:
  • There is a local_file resource, which allows us to write arbitrary local files. :thinking_face:
  • We may be able to misuse the local-exec provisioner…. :exploding_head:

What about writing a shell script to a file, which queries the ASG instances via aws CLI and execute this script? And executing it in its state?

Write a script

I’ll explain the important bits of this script right now. If you want to skip it, you can jump right to the solution.

How to query the instances?

We have to query all data about our instances first to have more fine grained data.

aws ec2 describe-instances \
  --filters \
    "Name=tag:aws:autoscaling:groupName,Values=<ASG_NAME>" \
    "Name=tag:aws:ec2launchtemplate:id,Values=<LAUNCH_TEMPLATE_ID>" \
    "Name=instance-state-name,Values=pending,running,shutting-down,stopping,stopped" \
  • With the aws:autoscaling:groupName filter, we only list instances, which are from our ASG.
  • With the Name=tag:aws:ec2launchtemplate:id filter, we filter out the instances, which aren’t spawned with our specific template. ASGs aren’t limited to a single template.
  • The third filter rule discards the instances, which are terminated. This is important, since terminated instances are still linked to their ASG but need no further processing and will get finally removed later.

But this aws cli command lists ALL instances for the ASG. Unfortunately we can’t use a negative match with the cli. So we have to introduce jq for it:

  | jq -r ".Reservations[].Instances[] | select(contains({Tags: [{Key: \"aws:ec2launchtemplate:version\", Value: \"<LATEST LAUNCH TEMPLATE VERSION>\"}]}) | not) | .InstanceId" \

This filters all instances, which are not launched from the latest template and prints their instances IDs.

So we’ve got a list of instances like the following, which we’re able to feed into a while/for loop later.

i-1234567890abcdef0
i-1234567890abcdef1
i-1234567890abcdef2

And finally to process a single instance, we can remove it with terminate-instance-in-auto-scaling-group from the ASG and tell with --no-should-decrement-desired-capacity, that the ASG should spawn a replacement for it:

aws autoscaling terminate-instance-in-auto-scaling-group \
  --no-should-decrement-desired-capacity \
  --instance-id "<instance ID>

The full example

So how does it look in general?


# A simle autoscaling group ...
resource "aws_autoscaling_group" "asg" {
  name             = "Autoscaling Group Example"
  min_size         = 10
  desired_capacity = 15
  max_size         = 20

  launch_template {
    id      = aws_launch_template.lt.id
    version = "$Latest"
  }
  ...
  ...
}

# ... with its launch template
resource "aws_launch_template" "lt" {
}

# Write a script to a file, which can drop the old instances from the ASG
resource "local_file" "autoscaling_instances_drop_old" {
  filename = "${path.module}/drop_old_instances.sh"
  content  = <<-EOF
    #!/bin/sh

    set -ex

    # Terminate all instances, which are in the ASG, but are not launched with the latest template
    # You can always execute it. If there are no old instances left,

    # FIXME: Adapt these terraform parameters when using this example.
    ASG="${aws_autoscaling_group.asg.id}"
    ASG_COOL="${aws_autoscaling_group.asg.default_cooldown}"
    LT_ID="${aws_launch_template.lt.id}"
    LT_VER="${aws_launch_template.lt.latest_version}"

    # List all instances, which are part of the ASG, but aren't started with the current latest_version of the launch template
    aws ec2 describe-instances \
      --filters \
        "Name=tag:aws:autoscaling:groupName,Values=$ASG" \
        "Name=tag:aws:ec2launchtemplate:id,Values=$LT_ID" \
        "Name=instance-state-name,Values=pending,running,shutting-down,stopping,stopped" \
      | jq -r ".Reservations[].Instances[] | select(contains({Tags: [{Key: \"aws:ec2launchtemplate:version\", Value: \"$LT_VER\"}]}) | not) | .InstanceId" \
      | while read instance_id; do

        # Terminate the old instance and let the ASG replace it afterwards
        aws autoscaling terminate-instance-in-auto-scaling-group \
          --no-should-decrement-desired-capacity \
          --instance-id "$instance_id"

        # Sleep as long as the ASG has a scaling cooldown, so we don't remove instances faster than it replaces them
        sleep "$ASG_COOL"
      done
  EOF
}

So after your terraform apply, just run ./drop_old_instances.sh too. It’ll remove all instances, only if required.

Ideas

  • You could combine approach 1 and 2 to write a script, which doesn’t kill the old instances one by one, but would create a schedule if necessary.
  • For Idea No. 1, you could also use a variable and add in your aws_autoscaling_schedule the following line count = "${var.replace_asg ? 1 : 0}", so you’d only replace the instances semi-automatically.