AWS for Industries

Unlock Powerful Genomic Insights with AWS HealthOmics Analytics and Amazon EMR

Analyzing large-scale genomic variant data just got easier with the latest release of Amazon EMR and its integration with AWS HealthOmics. HealthOmics Analytics provides tabular access to genetic variant data sets and annotations, empowering researchers and data scientists to uncover valuable insights through Amazon Athena queries and now Amazon EMR jobs.

While Amazon Athena excels at quick, interactive querying of data using SQL, Amazon EMR offers a more flexible big data platform for processing vast amounts of data. EMR leverages popular frameworks like Apache Spark, allowing you to go beyond the interactive model of Athena and gain greater flexibility and control over your compute environment and analytics workflow. This enables you to tackle complex, multi-stage data processing workflows via cost-effective EMR jobs. By combining HealthOmics Analytics Stores with the power of EMR, you can unlock a wide range of use cases, including genotype-phenotype association analysis, population-scale variant analysis, and integration with a diverse ecosystem of bioinformatics tools such as Hail, ADAM, GATK, and more. HealthOmics leverages AWS Lake Formation to ensure secure and centrally managed governance of your genomic data, giving you complete control over who can access it, including via EMR jobs. This blog post will guide you through the initial configuration and setup of EMR to help you get started with your first genomic data query.

At a high level

A high-level architecture diagram of showing the steps taken when a Spark job is run on EMR to query a HealthOmics Variant Store.

Today, we’ll be running a spark job on EMR (1). EMR integrates seamlessly with LakeFormation to secure credentials and authenticate access to your genomic data (2). We will write our queries against our HealthOmics Analytics Store in the Glue Data Catalog (3) which pulls the secured data from HealthOmics (4).

Initial setup

This blog will assume you’ve created a HealthOmics Variant and/or Annotation store and have imported genomic and/or annotation data into it already. If you haven’t, see the HealthOmics Analytics Documentation to get started.

There are a few initial steps required to enable your EMR cluster to access HealthOmics Analytics data through AWS Lake Formation. You will need a role permissive enough to interact with EMR, IAM, and Lake Formation.

Configuring Permissions in IAM

The following steps will modify the default EMR roles, in your workflow you can use any roles you’d like as long as they are given the correct IAM and Lake Formation permissions.

To create the default EMR roles, you can do so using the AWS command line interface with

aws emr create-default-roles --region <AWS_REGION>

This creates two IAM roles EMR_EC2_DefaultRole and EMR_DefaultRole which will be used by EMR.

EMR_EC2_DefaultRole

This role is assumed by all of the EC2 instances in the EMR Cluster. use it to grant the appropriate permissions to interact with other AWS services and resources as part of the data processing and management tasks. Aside from the default AmazonElasticMapReduceforEC2Role managed policy it should already have, we will modify this role with the following inline policy:

Please replace <AWS_ACCOUNT> with your AWS Account Id

Please replace <AWS_REGION> with the appropriate AWS Region

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "LakeFormationPermission",
            "Effect": "Allow",
            "Action": "lakeformation:GetDataAccess",
            "Resource": "*"
        },
        {
            "Sid": "DefaultDBPermissions",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:CreateDatabase"
            ],
            "Resource": [
                "arn:aws:glue:<AWS_REGION>:<AWS_ACCOUNT>:catalog",
                "arn:aws:glue:<AWS_REGION>:<AWS_ACCOUNT>:database/default"
            ]
        }
    ]
}

Breaking it down,

The LakeFormationPermission statement grants the role the ability to request temporary credentials from LakeFormation to access the underlying data.

The DefaultDBPermissions statement gives your cluster the ability to initialize with the default database. In EMR’s catalog integration, EMR always tries to check whether the default database exists. This will allow the role to create the default if necessary. If a default database already exists in Lake Formation, you can simply grant EMR_EC2_DefaultRole with Describe permission on it and leave out this statement.

EMR_DefaultRole

This role is assumed by the Amazon EMR service to help manage resources and perform various actions on your behalf during the lifecycle of an EMR cluster. It should already be configured with the permissions needed for this blog.

Lake Formation

Create a data lake admin

If you have not already done so you will need to create a data lake administrator. This will allow you to create the necessary resource links and grants required to access your data from EMR.

In the Lake Formation Console:

  1. In the left-side navigation panel, expand Administration
  2. Under Administration, choose Administrative roles and tasks
  3. On the page you will see a Data lake administrators box, choose add
  4. Add an IAM role you can assume (e.g. the role you are currently using in the console) and Confirm
  5. You should now see that IAM role in the Data lake administrators list
  6. If that role is different than the one you are currently using, log into the console with that role

Here I have a role named Admin I am making a Data lake Administrator, but you can use any role you’d like.

Adding a LakeFormation Admin via the AWS Lake Formation console page. Access type: Data lake administrator is selected and assigned the “Admin” role.

Change the default permissions model

This is necessary to let Lake Formation handle permissions on newly created resource links. Disable use of IAM access controls for new databases and database tables and revoke IAMAllowedPrinciples permission for database creators.

  1. In the left-side navigation panel, expand Administration
  2. Under Administration, choose Data Catalog settings
  3. On the page you will see a Default permissions for newly created databases and tables
  4. Uncheck and Save the following options:
    1. Use only IAM access control for new databases
    2. Use only IAM access control for new tables in new databases

LakeFormation Data Catalog Settings web form. Check boxes are unchecked.

Application integration for full table access

In the Lake Formation Console:

  1. In the left-side navigation, expand Administration
  2. Under Administration, choose Application integration settings.
  3. On the Application integration settings page, choose the checkbox to Allow external engines to access data in Amazon S3 locations with full table access

This will authorize the EMR query engine to request credentials to query against the underlying data.

LakeFormation Application Integration Settings web form. Allow external engines to access data in Amazon S3 locations with full table access checkbox is checked. Other boxes are unchecked.

Take Note of the HealthOmics Analytics Store Amazon S3 Path

To find your HealthOmics Analytics Store Amazon S3 Path:

  1. In the left-side navigation panel, expand Data Catalog
  2. Under Data Catalog, choose Databases
  3. On the page you will see a Databases box, look for your HealthOmics Analytics store
  4. Copy the S3 path listed under the Amazon S3 Path Column. Save this S3 Path, we will use it later in the EMR Cluster!

LakeFormation Databases Console filtered by the database name and showing the Amazon S3 path for the database.

Create a Data Lake Resource Link

Now that we’ve configured the Lake Formation settings and created our Lake Formation Admin, we can now create a resource link to HealthOmics Analytics store and grant the EMR_EC2_DefaultRole role permission to access it in EMR. If you used Amazon Athena to query your HealthOmics Analytics Store previously, it is the same process.

To learn more about how resource links work, see how resource links work in Lake Formation.

To create a resource link:

  1. Ensure you’re using the Lake Formation Admin Role previously set
  2. In the left-side navigation panel, expand Data Catalog
  3. Under Data Catalog, choose Databases
  4. On the page you will see a Databases box, choose your HealthOmics Analytics store
  5. Under Actions, choose Create resource link
  6. Choose a Resource Link Name – this is the name of the database we will use in EMR
    1. Note: This name must be a compliant SQL database name. To avoid having to escape the name your queries we suggest using only lowercase letters and underscores
  7. Leave everything else to its pre-set value

Creating a Resource Link in LakeFormation web form with resource link name, database region, shared database and shared database owner ID fields filled with example values.

Grant EMR_EC2_DefaultRole Describe permissions on the Resource Link Database

Recall when we created the default EMR roles, and set the lakeformation:GetDataAccess permission on EMR_EC2_DefaultRole. Since we will use this role to access our data through Lake Formation, we’ll need to grant this role Describe permission the Resource Link.

To grant Describe on the Resource Link

  1. Ensure you’re using the Lake Formation Admin Role previously set
  2. In the left-side navigation panel, expand Data Catalog
  3. Under Data Catalog, choose Databases
  4. On the page you will see a Databases box, choose Resource Link you created in the previous step
  5. Under Actions, choose Grant
  6. Choose the EMR_EC2_DefaultRole to add
  7. Under Resource link permissions, choose Describe
  8. Select Grant to save

Granting LakeFormation permissions on a database web form as it appears after adding the values as described in the instructions.

Grant EMR_EC2_DefaultRole Select and Describe permissions on the Resource Link Table

Similar to the previous step, we are going to grant another set of permissions on the resource link. This time we will grant permissions specifically on the HealthOmics Analytics store table our Resource Link refers to

To grant Select and Describe on the Resource Link table

  1. Ensure you’re using the Lake Formation Admin Role previously set
  2. In the left-side navigation panel, expand Data Catalog
  3. Under Data Catalog, choose Databases
  4. On the page you will see a Databases box, choose Resource Link you created
  5. Under Actions, choose Grant
  6. Choose the EMR_EC2_DefaultRole to add
  7. Under Tables, choose All Tables or the table that matches the name of your HealthOmics Analytics Store or Variant Cohort
  8. Under Table permissions, choose Select and Describe

Granting LakeFormation permissions on a table web form as it appears after adding the values as described in the instructions.

These are all the permissions required to access the data in EMR, you should be able to verify your permissions for EMR_EC2_DefaultRole on the Permissions page.

At the minimum, you should have:

  1. Describe permission on the Resource Link Database
  2. Describe permission on the HealthOmics Analytics Store Table
  3. Select permission on all the columns in the HealthOmics Analytics Store Table

Tabular display of LakeFormation permissions filtered by the EMR_EC2_DefaultRole

EMR Cluster Setup

We are now ready to setup EMR and query HealthOmics Analytics data.

Create an EMR Cluster

There are several ways to create/configure an EMR cluster, here we will create one with the smallest footprint required to access our HealthOmics Analytics data.

At the minimum, we will need:

  1. Amazon EMR Release emr-6.13.0 or greater
  2. Spark 3.4.1 or greater
  3. For AWS Glue Data Catalog Settings, ensure Use For Spark table Metadata is enabled
  4. Configuration for iceberg-defaults, value iceberg-enabled is true

To create a cluster:

  1. In the left-side navigation panel, expand EMR on EC2 and choose Clusters
  2. On the page you will see a Clusters box, choose Create Cluster
  3. Name your Cluster (e.g. omics-analytics-cluster)
  4. Select an Amazon EMR Release greater than or equal to emr-6.13.0
  5. Under Application Bundle, choose Custom
    1. Choose Spark 3.4.1 or greater
    2. Under AWS Glue Data Catalog settings, choose Use for Spark table metadata

  6. Under Software Settings, set the iceberg-defaults config
    [
       {
          "Classification": "iceberg-defaults",
             "Properties": {
                "iceberg.enabled": "true"
          }
       }
    ]
  7. Make sure to set an EC2 Key pair so you can SSH into the cluster
  8. Under Identity and Access Management (IAM) roles, set the default EMR roles we created earlier
    1. Under Amazon EMR Service Role, choose EMR_DefaultRole
    2. Under EC2 instance profile for Amazon EMR, choose EMR_EC2_DefaultRoleRecall that EMR_EC2_DefaultRole was the role we granted database and table access to in Lake Formation earlier
  9. Click Create Cluster

SSH into your EMR Cluster

Here are a few helpful reminders when SSHing into EC2

  1. Ensure the EC2 Security Group attached to your EC2 Instance allows for inbound SSH connections (port 22) from your computer’s IP Address
  2. Set read-only permissions on the pem file associated with your SSH Key
    1. e.g. chmod 400 emrKey.pem
  3. You can grab the SSH command from the Cluster’s detail page

A set of three instructions for connecting to EMR via SSH displayed when the Cluster’s detail page is selected.

You should now be able to SSH into the Primary Node your EMR Cluster.

EEEEEEEEEEEEEEEEEEEE MMMMMMMM           MMMMMMMM RRRRRRRRRRRRRRR    
E::::::::::::::::::E M:::::::M         M:::::::M R::::::::::::::R  
EE:::::EEEEEEEEE:::E M::::::::M       M::::::::M R:::::RRRRRR:::::R
  E::::E       EEEEE M:::::::::M     M:::::::::M RR::::R      R::::R
  E::::E             M::::::M:::M   M:::M::::::M   R:::R      R::::R
  E:::::EEEEEEEEEE   M:::::M M:::M M:::M M:::::M   R:::RRRRRR:::::R
  E::::::::::::::E   M:::::M  M:::M:::M  M:::::M   R:::::::::::RR  
  E:::::EEEEEEEEEE   M:::::M   M:::::M   M:::::M   R:::RRRRRR::::R  
  E::::E             M:::::M    M:::M    M:::::M   R:::R      R::::R
  E::::E       EEEEE M:::::M     MMM     M:::::M   R:::R      R::::R
EE:::::EEEEEEEE::::E M:::::M             M:::::M   R:::R      R::::R
E::::::::::::::::::E M:::::M             M:::::M RR::::R      R::::R
EEEEEEEEEEEEEEEEEEEE MMMMMMM             MMMMMMM RRRRRRR      RRRRRR

[hadoop@ip ~]$

Querying Your HealthOmics Analytics Store

You can use any of the spark commands available to you. Here we’ll make a query using the SparkSQL shell. In all cases the HealthOmics Analytics table is read only, however you may create derivative tables and store these in your account.

  • Replace <WAREHOUSE_LOCATION> with the full HealthOmics Analytics Store Amazon S3 Path
    • This is the HealthOmics Analytics Store Amazon S3 Path retrieved from earlier in this blog, use it now or refer back to the Lake Formation section on how to retrieve it
  • Replace <AWS_ACCOUNT> with your AWS Account ID
  • Replace <AWS_REGION> with the region your HealthOmics Analytics Store and EMR cluster reside in.

spark-sql \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.my_catalog.warehouse=<WAREHOUSE_LOCATION> \
    --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
    --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
    --conf spark.sql.catalog.my_catalog.glue.lakeformation-enabled=true \
    --conf spark.sql.catalog.my_catalog.glue.account-id=<AWS_ACCOUNT> \
    --conf spark.sql.catalog.my_catalog.client.region=<AWS_REGION> \
    --conf spark.sql.catalog.my_catalog.client.factory=org.apache.iceberg.aws.lakeformation.LakeFormationAwsClientFactory

You should now be able to run any of the following queries in the SparkSQL shell.

  • Use the catalog name you define in the configuration, this blog uses my_catalog
  • Replace <MY_RESOURCE_LINK> with the name of your Database Resource Link
  • Replace <TABLE_NAME> with your HealthOmics Analytics Store name or Variant Cohort you’d like to query

Describe the Table

spark-sql (default)> DESCRIBE my_catalog.<MY_RESOURCE_LINK>.<TABLE_NAME>;

Result

importjobid             string                                      
contigname              string                                      
start                   bigint                                      
end                     bigint                                      
names                   array<string>                              
referenceallele         string                                      
alternatealleles        array<string>                              
qual                    double                                      
filters                 array<string>                              
splitfrommultiallelic    boolean                                    
attributes              map<string,string>                          
phased                  boolean                                    
calls                   array<int>                                  
genotypelikelihoods     array<double>                              
phredlikelihoods        array<int>                                  
alleledepths            array<int>                                  
conditionalquality      int                                        
spl                     array<int>                                  
depth                   int                                        
ps                      int                                        
sampleid                string                                      
information             map<string,string>                          
annotations             struct<vep:array<struct<allele:string,consequence:array<string>,impact:string,symbol:string,gene:string,feature_type:string,feature:string,biotype:string,exon:struct<rank:string,total:string>,intron:struct<rank:string,total:string>,hgvsc:string,hgvsp:string,cdna_position:string,cds_position:string,protein_position:string,amino_acids:struct<reference:string,variant:string>,codons:struct<reference:string,variant:string>,existing_variation:array<string>,distance:string,strand:string,flags:array<string>,symbol_source:string,hgnc_id:string,extras:map<string,string>>>>
Time taken: 1.291 seconds, Fetched 23 row(s)

Count the number of variants where the Allele Frequency (AF score) is > 0.5

spark-sql (default)> SELECT count(*) FROM my_catalog.<MY_RESOURCE_LINK>.<TABLE_NAME> where attributes['AF'] > 0.5;

Result

32
Time taken: 8.763 seconds, Fetched 32 row(s)

Wrapping up

With the release of EMR 6.13.0. HealthOmics Analytics Data is now available in your EMR Cluster, unlocking more use-cases and greater scale for genomic analytics. Now you should be all set to run your HealthOmics Analytics Spark workloads at AWS. There are several ways to deploy EMR at AWS, such as with EC2, EKS, or Outpost. Check out the EMR pricing page to estimate the cost for your particular use-case. There are no additional costs for querying your data in your HealthOmics Analytics Store. Have fun and though your EMR cluster should auto-terminate, don’t forget to clean up your resources when your finished. On the EMR console select omics-analytics-cluster (or whatever name your gave your cluster) and then Terminate.

Clusters table displayed on the EMR console filtered by the cluster name “omics-analytics-cluster”

Guy Hawkins

Guy Hawkins

Guy is a tech lead across several AWS Healthcare and Life Science services including AWS HealthOmics. He has over 10 years of industry experience building and scaling distributed systems, ML/AI pipelines, high performance compute, and data lakes.