Train a Deep Learning Model with AWS Deep Learning Containers on Amazon EC2

TUTORIAL

Overview

AWS Deep Learning Containers (DL Containers) are Docker images pre-installed with deep learning frameworks to make it easy to deploy custom machine learning environments quickly by letting you skip the complicated process of building and optimizing your environments from scratch.

Using AWS DL Containers, developers and data scientists can quickly add machine learning to their containerized applications deployed on Amazon Elastic Container Service for Kubernetes (Amazon EKS), self-managed Kubernetes, Amazon Elastic Container Service (Amazon ECS), and Amazon EC2.

In this tutorial, you will train a TensorFlow machine learning model on an Amazon EC2 instance using the AWS Deep Learning Containers.

 AWS experience

Beginner

 Audience

Developers, Data Scientists

 Time to complete

10 minutes

 Cost to complete

Less than $1

 Requires

AWS Account

 Services used

AWS Deep Learning Containers, Amazon EC2, Amazon ECR

 Last updated

April 12, 2023

Implementation

1. Sign-up for AWS

You need an AWS account to follow this tutorial. There is no additional charge for using AWS Deep Learning Containers with this tutorial - you pay only for the Amazon c5.large instance used in this tutorial, which will be less than $1 after following termination steps at the end of this tutorial. 

Already have an account? Log in to your account

2. Add permissions for accessing Amazon ECR

AWS Deep Learning Container images are hosted on Amazon Elastic Container Registry (ECR), a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. In this step, you will grant an existing IAM user permissions to access Amazon ECR (using AmazonECS_FullAccess Policy).

If you do not have an existing IAM user, refer to the IAM Documentation for more information.

a. Navigate to the IAM console

Open the AWS Management Console, so you can keep this step-by-step guide open. When the screen loads, enter your user name and password to get started. Then type IAM in the search bar and select IAM to open the service console.

b. Select Users

Select Users from the navigation pane on the left.

c. Add Permissions

You will now add permissions to a new IAM user you created or to an existing IAM user. Select Add Permissions on the IAM user summary page.

d. Add the ECS Full Access Policy

Select Attach existing policies directly and search for ECS_FullAccess. Select the Amazon_FullAccess policy and click through to Review and Add Permissions.

e. Add inline policy

On the IAM user summary page, select Add inline policy.

f. Paste JSON policy

Select the JSON tab and paste the following policy:

{
       "Version": "2012-10-17",
       "Statement": [
              {
                     "Action": "ecr:*",
                     "Effect": "Allow",
                     "Resource": "*"
              }
       ]
}

Save this policy as ‘ECR’ and select Create Policy.

 

3. Launch an AWS Deep Learning Base AMI instance

In this tutorial, we will use AWS Deep Learning Containers on an AWS Deep Learning Base Amazon Machine Images (AMIs), which come pre-packaged with necessary dependencies such as Nvidia drivers, docker, and nvidia-docker. You can run Deep Learning Containers on any AMI with these packages.

a. Navigate to the EC2 console

Return to the AWS Management Console home screen and type EC2 in the search bar and select EC2 to open the service console.

b. Launch an Amazon EC2 instance

Navigate to the Amazon EC2 console again and select the Launch Instance button.

c. Select the AWS Deep Learning Base AMI

Choose the AWS Marketplace tab on the left, then search for ‘deep learning base ubuntu’. Select Deep Learning Base AMI (Ubuntu). You can also select the Deep Learning Base AMI (Amazon Linux).

d. Select the instance type

Choose an Amazon EC2 instance type. Amazon Elastic Compute Cloud (EC2) is the Amazon Web Service you use to create and run virtual machines in the cloud. AWS calls these virtual machines 'instances'.

For this tutorial, we will use a c5.large instance, but you can choose additional instance types, including GPU-based instances (such as G4, G5, P3, and P4).

Select Review and Launch.

e. Launch your instance

Review the details of your instance and select Launch.

f. Create a new private key file

On the next screen you will be asked to choose an existing key pair or create a new key pair. A key pair is used to securely access your instance using SSH. AWS stores the public part of the key pair which is just like a house lock. You download and use the private part of the key pair which is just like a house key.

Select Create a new key pair and give it the name. Then select Download Key Pair and you store your key in a secure location. If you lose your key, you won't be able to access your instance. If someone else gets access to your key, they will be able to access your instance.

If you have previously created a private key file that you can still access, you can use your existing private key instead by selecting Choose an existing key pair.

g. View instance details

Select the instance ID to view the details of your newly created Amazon EC2 on the console. 

4. Connect to your instance

In this step, you will connect to your newly launched instance using SSH. The instructions below use a Mac / Linux environment. If you are using Windows, follow step 4 on this tutorial.

a. Find and copy your instance’s public DNS

Under the Description tab, copy your Amazon EC2 instance’s Public DNS (IPv4).

b. Open your command line terminal

On your terminal, use the following commands to change to the directory where your security key is located, then connect to your instance using SSH.

cd /Users/<your_username>/Downloads/

chmod 0400 <your .pem filename>

ssh -L localhost:8888:localhost:8888 -i <your .pem filename> ubuntu@<your instance DNS>

c. Install Docker

Stop any ongoing system update, so we’re free to install Docker.

sudo pkill -f "apt.systemd.daily"
sudo apt install docker.io

5. Log in to Amazon ECR

AWS Deep Learning Container images are hosted on Amazon Elastic Container Registry (ECR), a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. In this step, you will login and verify access to Amazon ECR.

a. Configure your EC2 instance with your AWS credentials

You need to provide your AWS Access Key ID and Secret Access Key. If you don’t already have this information, you can create an Access Key ID and Secret Access Key here.

b. Log in to Amazon ECR

You will use the command below to log in to Amazon ECR:

sudo su – 
$(aws ecr get-login --region us-east-1 --no-include-email --registry-ids 763104351884)

Note: You need to include ‘$’ and parantheses in your command. You will see ‘Login Succeeded’ when this step concludes.

 

6. Run TensorFlow training with Deep Learning Containers

In this step, we will use an AWS Deep Learning Container image for TensorFlow training on CPU instances with Python 3.6.

a. Run AWS Deep Learning Containers

You will now run AWS Deep Learning Container images on your EC2 instance using the command below. This command will automatically pull the Deep Learning Container image if it doesn’t exist locally.

If using CPU instance:

docker run -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.8.0-cpu-py39-ubuntu20.04-e3
Note: This step may take a few minutes depending on the size of the image. If you are using a GPU instance, use ‘nvidia-docker’ instead of ‘docker.’ Once this step completes successfully, you will enter a bash prompt for your container.

b. Pull an example model to train

We will clone the Keras repository, which includes example Python scripts to train models.

git clone https://github.com/gilinachum/keras

c. Start training

Start training the canonical MNIST CNN model with the following command:

python keras/mnist.py

You have just successfully commenced training with your AWS Deep Learning Container.

 

7. Terminate Your Resources

In this step, you will terminate the Amazon EC2 instance you created during this tutorial.

Important: Terminating resources that are not actively being used reduces costs and is a best practice. Not terminating your resources can result in charges to your account.

a. Select your running instance

On the Amazon EC2 Console, select Running Instances.

 

b. Terminate your EC2 instance

Select the EC2 instance you created and choose Actions > Instance State > Terminate.

c. Confirm termination

You will be asked to confirm your termination. Select Yes, Terminate.

Note: This process can take several seconds to complete. Once your instance has been terminated, the Instance State will change to terminated on your EC2 Console.

 

Conclusion

You have successfully trained an MNIST CNN model with TensorFlow using AWS Deep Learning Containers.

You can use AWS DL Containers for training and inference on CPU and GPU resources on Amazon EC2, Amazon ECS, Amazon EKS, and Kubernetes.

Use these stable deep learning images, which have been optimized for performance and scale on AWS, to build your own custom deep learning environments.

Was this page helpful?

Next steps