Automating AWS EC2 Spot Instance Backup & Deployment

Introduction

Using on demand AWS EC2 servers can quickly become expensive if you’re paying for it personally. As projects become more complex, the specifications of the infrastructure become more demanding & sophisticated and hence ends up costing more.

However, AWS also offers spot instance EC2 servers. Spot instances effectively allow you to bid on unused EC2 servers as opposed to paying the full on demand cost. The advantage of this is that the cost is significantly less, however there is a risk of being outbid and losing all of the scripts and data on the server instantaneously.

I decided to transition to using a spot instance with some automated features such as backups. I wanted to write an article detailing the process required to implement this for anyone else who is finding that on demand EC2 servers are becoming too costly.

This article will cover the following:

  • Setting up an IAM role for running the automated procedures
  • Setting up JupyterHub as a service so it’s already running on the new server via bash on Ubuntu 18.04
  • Scheduling an hourly backup of your EC2 spot instance to AMI via Lambda
  • Automatically starting a new spot instance server at 8am via Lambda
  • Creating a bot on Slack to share the new spot instance JupyterHub web address to your team
  • Automatically shutting down the spot instance server at 2am via Lambda

Note:

This article assumes a fairly high level of technical understanding on using AWS, Linux & Python, however, I will try to be as detailed as possible.

This article also assumes you already have an EC2 server set up with JupyterHub installed, if this isn’t the case please go through the following article first – Setting Up JupyterHub On AWS EC2

This article includes a section on setting up a slack bot to automatically send out the new web address for your spot instance server. This section can be skipped if you don’t use slack.

Preliminary Checks

You should currently have an AWS EC2 server set up using on demand pricing. The server should have JupyterHub installed and be accessible via using the EC2 IPV4 address followed by the port number e.g. https://56.353.12.452:8888

You should also ensure you have the correct security groups set up and the .PPK file used to SSH in to the server.

If you’re missing any of these please use the above article to set up JupyterHub on AWS EC2 first.

Setting Up An IAM Role

To automate a lot of the procedures in this article I use lambda functions; an AWS service that allows you to run small snippets of code by certain “triggers”. The scripts in the lambda functions will require an access token and the correct permissions to access the AWS environment. This can be done through IAM on AWS.

Create a new user on IAM. I would suggest something along the lines of “Boto3”, then select “Programmatic access” as can be seen below. This will generate the necessary access tokens.

You will then need to add the appropriate permission policies to allow the user to access the services that we will be using. Depending on your wider infrastructure you may need to tweak this, however if you add the “AdministratorAccess” policy this ensure there are no permission issues. (Giving full admin access can have risks)

At this point you should make a note of the Access Key ID, Access Security Token & User ARN number as seen below, please keep these safe.

Setting Up JupyterHub As A Service

Before we can start working on automating spot instances, we first need to set up JupyterHub as a service on our current on demand server. The reason for this is that when the new spot instance server is created using the AMI backup, it will have JupyterHub installed but it will need to be manually started via SSH by running
jupyter-notebook

To avoid this manual step we will need to ensure JupyterHub is set up as a service. This means that it will automatically be running on the Ubuntu server as soon as it’s created or after it’s restarted and will be accessible via the IPV4 and the port number.

In Ubuntu 18.04 services are scheduled using the Systemd functionality. First SSH in to the current on demand EC2 server and then generate the jupyter service file using
vim /etc/systemd/system/jupyter.service

You’ll now be in the text editor for the Jupyter service, copy and paste the below code, if you followed the previous article you won’t have to make any changes to the below.

[Unit]
Description=Automated Jupyter Service

[Service]
Type=simple
PIDFile=/run/jupyter.pid
ExecStart=/home/ubuntu/anaconda3/bin/jupyter-notebook --config=/home/ubuntu/.jupyter/jupyter_notebook_config.py
User=ubuntu
Group=ubuntu
WorkingDirectory=/home/ubuntu/notebooks
Restart=always
RestartSec=15

[Install]
WantedBy=multi-user.target

Save your changes by pressing esc and then typing :wq

Then run the following commands to add the service to daemon.

systemctl enable jupyter.service
systemctl daemon-reload
systemctl restart jupyter.service

At this point Jupyter has been set up as a service and will always be running, you can test this if you want by restarting your on demand EC2 via the console and accessing Jupyter via the standard IPV4 and port number.

Scheduling An Automated Hourly AMI Backup

Now that we’ve got an on demand server running with Jupyter as a service and the necessary IAM policies it’s time to schedule the automated hourly AMI backup via AWS Lambda. For reference an AMI takes an exact copy of your machine and backs it up for later restoration purposes, the AMI copies the exact operating system, the files & the packages and is ideal for our use case.

For the first time, we will need to manually create the AMI by logging in to the EC2 console, selecting Actions > Image > Create Image

This can be seen below. Please make a note of the image name as this will be needed later.

Now we need to navigate to AWS Lambda and create a new function. Make sure you choose “use an existing role” and pick the IAM role we created earlier. For the purpose of this guide, choose Python 3.7 in the “runtime” option as can be seen below.

Copy the code from the below GitHub script in to the IDE on Lambda, ensure you replace the SecretAccessKey, AccessKeyID, RegionName & AMIName with the figures noted from the previous steps.

https://github.com/DataInspector/AutomateAWSEC2/blob/master/AutoBackupEC2.py 

This script will connect to your AWS infrastructure and for the specified region will find any running instances. It will then overwrite the current AMI with a new one taken from the current EC2 servers state.

Note: The script is designed to work where there is only one server in the region and requires a pre-existing AMI (hence the manual step earlier.)

We now need to setup a CloudWatch event to automatically run the script every hour. We can do this by adding the trigger from the left menu to the Lambda function as seen in the below image.

Note: I used a schedule pattern of rate(1 hour) However this can be tweaked based on your use case.

The lambda function is now configured to run on an hourly basis. It will check if there’s any servers running in the specified region and will overwrite the current AMI with a snapshot of the server. If it can’t find a running server (in case of outbid) it will skip this procedure until the next hour to prevent overwriting the AMI.

We can now shutdown our on demand server as we have the AMI stored with the Jupyter service running which is what we needed.

Automatically Starting A Spot Instance At 8am

We will now set up an additional Lambda function to automatically start a new spot instance at 8AM. The server will already have Jupyter running and we can access this using the IPV4 address from the AWS console and attaching the port number :8888 at the end.

However, in the interest of being automated, I wanted to automatically share this new Jupyter web address with my team, rather than have them individually log in to AWS to get the IPV4 address each morning. I ended up building a small slack chat bot which automatically shares the new server address with the team once a new spot instance is created.

If you would like to set up a spot instance without using the slack feature, please follow the below instructions in the No Slack Version.

If you want to implement the slack bot feature skip the No Slack Version and use the Slack Version instead.

No Slack Version

Skip this section if you want a slack notification included

Create a new lambda function as we did in the previous step.

Copy the code from the below GitHub script in to the IDE on Lambda, ensure you replace the SecretAccessKey, AccessKeyID, RegionName, AMIName, SpotPrice, InstanceType, KeyName, SecurityGroups, ARN with the figures noted from the previous steps.

Note: You can get instance types and their current spot price by visiting – https://aws.amazon.com/ec2/spot/pricing/

 https://github.com/DataInspector/AutomateAWSEC2/blob/master/AutoSpotEC2.py 

Add a CloudWatch event trigger as we did previously but this time instead of using rate(1 hour) use a cron expression instead, for examples on cron expressions see https://docs.aws.amazon.com/lambda/latest/dg/tutorial-scheduled-events-schedule-expressions.html

To run the Lambda function to start the server at 8am use the following cron(0 8 ? * MON-SUN *)

Since the script can take a bit of time to run we need to increase the timeout duration of the Lambda function by scrolling down to the section seen below.

A spot instance server will now be started every morning at 8am with a working and running instance of JupyterHub!

Slack Version

Skip this section if you don’t want an automated slack bot and already followed the above section

The first step is to make a Slack bot which has the correct permissions to post in your group, please note down the access token when doing this.

For a detailed guide on setting this up please refer to the below link.

Note: you just need to get to the point where you’ve generated an access token, don’t follow the whole article – https://www.fullstackpython.com/blog/build-first-slack-bot-python.html

Now, create a new lambda function as we did in the previous step.

Unfortunately, using a slack bot on Python requires the “slackclient” python package installed. Making this available within Lambda has some additional steps which I will outline below.

First, copy the code from the below GitHub script in to the IDE on Lambda, ensure you replace the SecretAccessKey, AccessKeyID, RegionName, AMIName, SpotPrice, InstanceType, SlackToken, SlackChannel, KeyName, SecurityGroups & ARN with the figures noted from the previous steps.

Note: You can get instance types and their current spot price by visiting – https://aws.amazon.com/ec2/spot/pricing/

Note: the SlackChannel is simply the channel name within Slack that the bot will post the new server IP on.

https://github.com/DataInspector/AutomateAWSEC2/blob/master/AutoSpotEC2WithSlack.py 

Now export the Lambda function as a ZIP by clicking Actions > Export Function > Download Deployment Package

On your local machine, unzip the downloaded package.

Using a Python terminal on your local machine run the following command and change the “/path/to/unzip” to the path where you unzipped the deployment package from Lambda.

pip install slackclient==2.0.0 -t /path/to/unzip

This will download the slackclient package and it’s dependencies to the specified folder. We can now re-zip the folder and upload it to lambda by clicking Code Entry Type > Upload a .zip File as can be seen below.

Add a CloudWatch event trigger as we did previously but this time instead of using rate(1 hour) use a cron expression instead, for examples on cron expressions see https://docs.aws.amazon.com/lambda/latest/dg/tutorial-scheduled-events-schedule-expressions.html

These steps may have felt convoluted, but were necessary to prepare the Lambda functions environment so that it can run a slack client and automatically send the new server web address.

To run the Lambda function to start the server at 8am use the following cron(0 8 ? * MON-SUN *)

Since the script can take a bit of time to run we need to increase the timeout duration of the Lambda function by scrolling down to the section seen below.

A spot instance server will now be started every morning at 8am with a working and running instance of JupyterHub. A slack bot has also been setup that will automatically email your team the new web address for JupyterHub every morning!

Automatically Shutting Down A Spot Instance At 2AM

The last step is to set up a lambda function to automatically shutdown the server at 2AM to minimise costs. The above sections will ensure that an exact copy of the server is recreated at 8AM using the AMI backup.

Create a new lambda function as we did in the previous step.

Copy the code from the below GitHub script in to the IDE on Lambda, ensure you replace the SecretAccessKey, AccessKeyID & RegionName with the figures noted from the previous steps.

https://github.com/DataInspector/AutomateAWSEC2/blob/master/AutoTerminateEC2.py

Add a CloudWatch event trigger as we did previously but this time instead of using rate(1 hour) use a cron expression instead, for examples on cron expressions see https://docs.aws.amazon.com/lambda/latest/dg/tutorial-scheduled-events-schedule-expressions.html

To run the Lambda function to terminate the server at 2am use the following cron(0 2 ? * MON-SUN *)

Conclusion

If you managed to make it this far through the article, congratulations!

Although it took a lot of steps, we now have a system in place which does the following:

  • Backs up the spot instance server every hour
  • Terminates the server at 2am
  • Boots up a replica spot instance server at 8am with JupyterHub running
  • Sends an automated slack message to your team with the JupyterHub address.

For reference the slack message looks like the below:

The bot name, instance type & Jupyter address will reflect your configuration

As a result of implementing this new system, I was able to cut down on our infrastructure costs by approximately 80%!

Although there is a risk of losing progress we can always revert to the previous hours backup which minimises the loss. Depending on the importance of minimising the loss, as an example, we can increase the frequency of the backup by changing rate(1 hour) to rate(15 minutes)

Inspectors Notes

Harib – The main reason I decided to go through this process was that we were beginning to use GPU servers and instances with a large amount of memory (RAM.) As a result, the monthly cost was growing exponentially. Since we don’t have important commercial data, I decided that losing a potential maximum of 59 minutes worth of work was worth the risk. Spot instances are rarely outbid and historic pricing can be seen on AWS, using this we can choose a high tolerance bid limit if needs be.

This now means that we are able to use more powerful servers whilst keeping costs low, it also means that we don’t have wasted expenditure during the middle of the night when nobody is working.

We have been using this system for a while now and we’ve yet to experience any issues in doing so. One of the current limitations of the approach is that only one server can be running in the specified region but this can easily be tweaked in the above scripts and wasn’t needed in our use case.

Some of the steps detailed in this guide can be quite tricky. If you have any issues following the steps, let me know and I’ll be more than happy to try and help. Alternatively, the AWS documentation is extremely detailed.

GitHub

All scripts can be found at – https://github.com/DataInspector/AutomateAWSEC2

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *