Use Kerberos Authentication to Integrate Amazon EMR with Microsoft Active Directory
Many enterprises use Microsoft Active Directory to manage users, groups, and computers in a network. And a question is asked frequently: How can Active Directory users access big data workloads running on Amazon EMR with the same single sign-on (SSO) experience they have when accessing resources in the Active Directory network?
This post walks you through the process of using AWS CloudFormation to set up a cross-realm trust and extend authentication from an Active Directory network into an Amazon EMR cluster with Kerberos enabled. By establishing a cross-realm trust, Active Directory users can use their Active Directory credentials to access an Amazon EMR cluster and run jobs as themselves.
Walkthrough overview
In this example, you build a solution that allows Active Directory users to seamlessly access Amazon EMR clusters and run big data jobs. Here’s what you need before setting up this solution:
- An AWS account
- An Amazon EC2 key pair
- A possible limit increase for your account (Note: Usually a limit increase will not be necessary. See the AWS Service Limits documentation if you encounter a limit error while building the solution.)
To make it easier for you to get started, I created AWS CloudFormation templates that automatically configure and deploy the solution for you. The following steps and resources are involved in setting up the solution:
- Create and configure an Amazon Virtual Private Cloud (Amazon VPC).
- Launch an Amazon EC2 Windows instance (Active Directory domain controller).
- Create an Amazon EMR security configuration for Kerberos and cross-realm trust.
- Launch an Amazon EMR cluster with Kerberos enabled and a cross-realm trust configuration.
You can use the AWS CloudFormation templates to complete each step individually, or you can deploy the entire solution through a single step.
- To skip the basics and deploy the entire solution through the single-step AWS CloudFormation template, go to the Single-step solution deployment
- To set up each component individually, go to the Deploying each component individually
Note: If you want to manually create and configure the components for this solution without using AWS CloudFormation, refer to the Amazon EMR cross-realm documentation.
IMPORTANT: The AWS CloudFormation templates used in this post are designed to work only in the us-east-1 (N. Virginia) or us-west-2 (Oregon) regions. They are not intended for production use without modification.
Single-step solution deployment
If you don’t want to set up each component individually, you can use the single-step AWS CloudFormation template. The single-step template is a master template that uses nested stacks (additional templates) to launch and configure all the resources for the solution in one go.
To deploy the single-step template into your account, choose Launch Stack:
This takes you to the Create stack wizard in the AWS CloudFormation console. The template is launched in the US East (N. Virginia) Region by default. Do not change to a different Region because the template is designed to work only in us-east-1 (N. Virginia).
On the Select Template page, keep the default URL for the AWS CloudFormation template, and then choose Next.
On the Specify Details page, review the parameters for the template. Provide values for the parameters that require input (for more information, see the parameters table that follows).
The following parameters are available in this template.
Parameter | Default | Description |
Domain Controller name | DC1 | NetBIOS (hostname) name of the Active Directory server. This name can be up to 15 characters long. |
Active Directory domain | example.com | Fully qualified domain name (FQDN) of the forest root domain (for example, example.com). |
Domain NetBIOS name | EXAMPLE | NetBIOS name of the domain for users of earlier versions of Windows. This name can be up to 15 characters long. |
Domain admin user | CrossRealmAdmin | User name for the account that is added as domain administrator. This account is separate from the default administrator account. |
Domain admin password | Requires input | Password for the domain admin user. Must be at least eight characters including letters, numbers, and symbols. |
Key pair name | Requires input | Name of an existing key pair, which enables you to connect securely to your instance after it launches. |
Instance type | m4.xlarge | Instance type for the domain controller and the Amazon EMR cluster. |
Allowed IP address | 10.0.0.0/16 | The client IP address can that can reach your cluster. Specify an IP address range in CIDR notation (for example, 203.0.113.5/32). By default, only the VPC CIDR (10.0.0.0/16) can reach the cluster. Be sure to add your client IP range so that you can connect to the cluster using SSH. |
EMR Kerberos realm | EC2.INTERNAL | Cluster’s Kerberos realm name. By default, the realm name is derived from the cluster’s VPC domain name in uppercase letters (for example, EC2.INTERNAL is the default VPC domain name in the us-east-1 Region). |
Trusted AD domain | EXAMPLE.COM | The Active Directory (AD) domain that you want to trust. This is the same as the “Active Directory domain.” However, it must use all uppercase letters (for example, EXAMPLE.COM). |
Cross-realm trust password | Requires input | Password that you want to use for your cross-realm trust. |
Instance count | 2 | The number of instances (core nodes) for the cluster. |
After you specify the template details, choose Next. On the Options page, choose Next again. On the Review page, select the I acknowledge that AWS CloudFormation might create IAM resources with custom names check box, and then choose Create.
It takes approximately 45 minutes for the deployment to complete. When the stack launch is complete, it will return outputs with information about the resources that were created. Note the outputs and skip to the Managing and testing the solution section. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:
Deploying each component individually
This section describes how to use AWS CloudFormation templates to perform each step separately in the solution.
Create and configure an Amazon VPC
In order for you to establish a cross-realm trust between an Amazon EMR Kerberos realm and an Active Directory domain, your Amazon VPC must meet the following requirements:
- The subnet used for the Amazon EMR cluster must have a CIDR block of fewer than nine digits (for example, 10.0.1.0/24).
- Both DNS resolution and DNS hostnames must be enabled (set to “yes”).
- The Active Directory domain controller must be the DNS server for instances in the Amazon VPC (this is configured in the next step).
To use the AWS CloudFormation template to create and configure an Amazon VPC with the prerequisites listed previously, choose Launch Stack:
Note: If you want to create the VPC manually (without using AWS CloudFormation), see Set Up the VPC and Subnet in the Amazon EMR documentation.
Launching this stack creates the following AWS resources:
- Amazon VPC with CIDR block 10.0.0.0/16 (Name: CrossRealmVPC)
- Internet Gateway (Name: CrossRealmGateway)
- Public subnet with CIDR block 10.0.1.0/24 (Name: CrossRealmSubnet)
- Security group allowing inbound access from the VPC’s subnets (Name tag: CrossRealmSecurityGroup)
When the stack launch is complete, it should return outputs similar to the following.
Key | Value example | Description |
SubnetID | subnet-xxxxxxxx | The subnet for the Active Directory domain controller and the EMR cluster. |
SecurityGroup | sg-xxxxxxxx | The security group for the Active Directory domain controller. |
VPCID | vpc-xxxxxxxx | The Active Directory domain controller and EMR cluster will be launched on this VPC. |
Note the outputs because they are used in the next step. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:
Launch and configure an Active Directory domain controller
In this step, you use an AWS CloudFormation template to automatically launch and configure a new Active Directory domain controller and cross-realm trust.
Note: There are various ways to install and configure an Active Directory domain controller. For details on manually launching and installing a domain controller without AWS CloudFormation, see Step 2: Launch and Install the AD Domain Controller in the Amazon EMR documentation.
In addition to launching and configuring an Active Directory domain controller and cross-realm trust, this AWS CloudFormation template also sets the domain controller as the DNS server (name server) for your Amazon VPC. In other words, the template creates a new DHCP option-set for the VPC where it’s being deployed to, and it sets the private IP of the domain controller as the name server for that new DHCP option set.
IMPORTANT: You should not use this template on a production VPC with existing resources like Amazon EC2 instances. When you launch this stack, make sure that you use the new environment and resources (Amazon VPC, subnet, and security group) that were created in the Create and configure an Amazon VPC step.
To launch this stack, choose Launch Stack:
The following table contains information about the parameters available in this template. Review the parameters for the template and provide values for those that require input.
Parameter | Default | Description |
VPC ID | Requires input | Launch the domain controller on this VPC (for example, use the VPC created in the Create and configure an Amazon VPC step). |
Subnet ID | Requires input | Subnet used for the domain controller (for example, use the subnet created in the Create and configure an Amazon VPC step). |
Security group ID | Requires input | Security group (SG) for the domain controller (for example, use the SG created in the Create and configure an Amazon VPC step). |
Domain Controller name | DC1 | NetBIOS name of the Active Directory server (up to 15 characters). |
Active Directory domain | example.com | Fully qualified domain name (FQDN) of the forest root domain (for example, example.com). |
Domain NetBIOS name | EXAMPLE | NetBIOS name of the domain for users of earlier versions of Windows. This name can be up to 15 characters long. |
Domain admin user | CrossRealmAdmin | User name for the account that is added as domain administrator. This account is separate from the default administrator account. |
Domain admin password | Requires input | Password for the domain admin user. Must be at least eight characters including letters, numbers, and symbols. |
Key pair name | Requires input | Name of an existing EC2 key pair to enable access to the domain controller instance. |
Instance type | m4.xlarge | Instance type for the domain controller. |
EMR Kerberos realm | EC2.INTERNAL | Cluster’s Kerberos realm name. By default, the realm name is derived from the cluster’s VPC domain name in uppercase letters (for example, EC2.INTERNAL is the default VPC domain name in the us-east-1 Region). |
Cross-realm trust password | Requires input | Password that you want to use for your cross-realm trust. |
It takes 25–30 minutes for this stack to be created. When it’s complete, note the stack’s outputs, and then move to the next step: Launch an EMR cluster with Kerberos enabled.
Create a security configuration and launch an Amazon EMR cluster with Kerberos enabled
To launch a kerberized Amazon EMR cluster, you first need to create a security configuration containing the cross-realm trust configuration. You then specify cluster-specific Kerberos attributes when launching the cluster.
In this step, you use AWS CloudFormation to launch and configure a kerberized Amazon EMR cluster with a cross-realm trust. If you want to manually launch and configure a cluster with Kerberos enabled, see Step 6: Launch a Kerberized EMR Cluster in the Amazon EMR documentation.
To create a cross-realm trust security configuration and launch a kerberized Amazon EMR cluster using AWS CloudFormation, choose Launch Stack:
The following table lists and describes the template parameters for deploying a kerberized Amazon EMR cluster and configuring a cross-realm trust.
Parameter | Default | Description |
Active Directory domain | example.com | The Active Directory domain that you want to establish the cross-realm trust with. |
Domain admin user (joiner user) | CrossRealmAdmin | The user name of an Active Directory domain user with privileges to join domains/computers to the Active Directory domain (joiner user). |
Domain admin password | Requires input | Password of the joiner user. |
Cross-realm trust password | Requires input | Password of your cross-realm trust. |
EC2 key pair name | Requires input | Name of an existing key pair, which enables you to connect securely to your cluster after it launches. |
Subnet ID | Requires input | Subnet that you want to use for your Amazon EMR cluster (for example, choose the subnet created in the Create and configure an Amazon VPC step). |
Security group ID | Requires input | Security group that you want to use for your Amazon EMR cluster (for example, choose the security group created in the Create and configure an Amazon VPC step). |
Instance type | m4.xlarge | The instance type that you want to use for the cluster nodes. |
Instance count | 2 | The number of instances (core nodes) for the cluster. |
Allowed IP address | 10.0.0.0/16 | The client IP address can that can reach your cluster. Specify an IP address range in CIDR notation (for example, 203.0.113.5/32). By default, only the VPC CIDR (10.0.0.0/16) can reach the cluster. Be sure to add your client IP range so that you can connect to the cluster using SSH. |
EMR Kerberos realm | EC2.INTERNAL | Cluster’s Kerberos realm name. By default, the realm name is derived from the cluster’s VPC domain name in uppercase letters (for example, EC2.INTERNAL is the default VPC domain name in the us-east-1 Region). |
Trusted AD domain | EXAMPLE.COM | The Active Directory domain that you want to trust. This name is the same as the “AD domain name.” However, it must use all uppercase letters (for example, EXAMPLE.COM). |
It takes 10–15 minutes for this stack to be created. When it’s complete, note the stack’s outputs, and then move to the next section: Managing and testing the solution.
Managing and testing the solution
Now that you’ve configured and built the solution, it’s time to test it by connecting to a cluster using Active Directory credentials.
SSH to a cluster using Active Directory credentials (single sign-on)
After you launch a kerberized Amazon EMR cluster, if you used the AWS CloudFormation templates and added your client IP range to the Allowed IP address parameter, you should be able to connect to the cluster using an SSH client and your Active Directory user credentials. If you have trouble connecting to the cluster using SSH, check the cluster’s security group to make sure that it allows inbound SSH connection (TCP port 22) from your client’s IP address (source).
The following steps assume that you’re using a client such as OpenSSH. If you’re using a different SSH application (for example, PuTTY), consult the application-specific documentation.
Note: Because the cluster was launched with a cross-realm trust configuration, you don’t need to use a private key (.pem file) when you connect to it as a domain user using SSH.
To connect to your Amazon EMR cluster as an Active Directory user using SSH, run the following command. Replace ad_user with the domain admin user that you created while setting up the domain controller and replace master_node_URL with the cluster’s URL (see the stack’s outputs to find this information):
If your SSH client is configured to use a key as the preferred authentication method, the login might fail. If that’s the case, you can add the following options to your SSH command to force the SSH connection to use password authentication:
After a domain user connects to the cluster using SSH, if this is the first that the user is connecting, a local home directory is created for that user. In addition to creating a local home directory, if you used the create-hdfs-home-ba.sh bootstrap action when launching the cluster (done by default if you used the AWS CloudFormation template to launch a kerberized cluster), an HDFS user home directory is also automatically created.
Note: If you manually launched the cluster and did not use the create-hdfs-home-ba.sh bootstrap action, then you’ll need to manually create HDFS user home directories for your users.
When you connect to the cluster using SSH for the first time (as a domain user), you should see the following messages if the HDFS home directory for your domain user was successfully created:
Running jobs on a kerberized Amazon EMR cluster
To run a job on a kerberized cluster, the user submitting the job must first be authenticated. If you followed the previous section to connect to your cluster as an Active Directory user using SSH, the user should be authenticated automatically.
If running the klist command returns a “No credentials cache found” message, it means that the user is not authenticated (the user doesn’t have a Kerberos ticket). You can re-authenticate a user at any time by running the following command (be sure to use all uppercase letters for the Active Directory domain):
When the user is authenticated, they can submit jobs just like they would on a non-kerberized cluster.
Auditing jobs
Another advantage that Kerberos can provide is that you can easily tell which user ran a particular job. For example, connect (using SSH) to a kerberized cluster with an Active Directory user, and submit the SparkPi sample application:
After running the SparkPi application, go to the Amazon EMR console and choose your cluster. Then choose the Application history tab. There you can see information about the application, including the user that submitted the job:
Common issues
Although it would be hard to cover every possible Kerberos issue, this section covers some of the more common issues that might occur and ways to fix them.
Issue 1: You can successfully connect and get authenticated on a cluster. However, whenever you try running job, it fails with an error similar to the following:
org.apache.hadoop.security.AccessControlException: Permission denied
Solution: Make sure that an HDFS home directory for the user was created and that it has the right permissions.
Issue 2: You can successfully connect to the cluster, but you can’t run any Hadoop or HDFS commands.
Solution: Use the klist command to confirm whether the user is authenticated and has a valid Kerberos ticket. Use the kinit command to re-authenticate a user.
Issue 3: You can’t connect (using SSH) to the cluster using Active Directory user credentials, but you can manually authenticate the user with kinit.
Solution: Make sure that the Active Directory domain controller is the DNS server (name server) for the cluster nodes.
Cleaning up
After completing and testing this solution, remember to clean up the resources. If you used the AWS CloudFormation templates to create the resources, then use the AWS CloudFormation console or AWS CLI/SDK to delete the stacks. Deleting a stack also deletes the resources created by that stack.
If one of your stacks does not delete, make sure that there are no dependencies on the resources created by that stack. For example, if you deployed an Amazon VPC using AWS CloudFormation and then deployed a domain controller into that VPC using a different AWS CloudFormation stack, you must first delete the domain controller stack before the VPC stack can be deleted.
Summary
The ability to authenticate users and services with Kerberos not only allows you to secure your big data applications, but it also enables you to easily integrate Amazon EMR clusters with an Active Directory environment. This post showed how you can use Kerberos on Amazon EMR to create a single sign-on solution where Active Directory domain users can seamlessly access Amazon EMR clusters and run big data applications. We also showed how you can use AWS CloudFormation to automate the deployment of this solution.
Additional Reading
Learn how to run Jupyter Notebook and JupyterHub on Amazon EMR.
About the Author
Bruno Faria is an EMR Solution Architect with AWS. He works with our customers to provide them architectural guidance for running complex applications on Amazon EMR. In his spare time, he enjoys spending time with his family and learning about new big data solutions.