AWS for Industries
How to process medical text in multiple languages using Amazon Translate and Amazon Comprehend Medical
Amazon Comprehend Medical is a HIPAA eligible service that uses deep learning to identify and extract relevant information from medical text. The service uses trained Natural Language Processing (NLP) models to identify medical entities and relationships, such as medication, dosage, diagnosis, and Protected Health Information (PHI). This provides an efficient and cost-effective way to mine data from a large number of clinical reports, which can typically be an expensive and error prone process when done manually. Access to this data is valuable for enabling a number of use cases, including patient case management for better outcomes, clinical research, and medical billing and healthcare revenue cycle management. When data from clinical reports is brought together with Electronic Medical Records (EMRs), a more complete view of the patient can be created for providers.
The initial release of Amazon Comprehend Medical can only detect medical entities in English language texts. The solution in this blog demonstrates how you can use Amazon Translate, in conjunction with Amazon Comprehend Medical, to process medical texts in multiple languages. We create an endpoint using Amazon API Gateway that allows for simple processing of multiple records, and create an Amazon Athena table for easy querying and visualization of results. The solution is provided as an AWS CloudFormation template so that it can be easily deployed into your environment.
Solution overview
The architecture diagram above displays the various components of the solution. Here are the details of each component:
- Clinical notes to be processed by this solution need to be stored in a bucket on Amazon S3.
- Amazon API Gateway provides the endpoint for the solution. The URL for the endpoint can be found in the Outputs of the CloudFormation script, and access to the endpoint is limited to the IP address(es) provided as inputs to the CloudFormation script. The URL requires a bucket name as a query string parameter, and a specific prefix in that bucket can be optionally specified. The solution runs on all files present in that bucket/prefix.
- The API Gateway endpoint calls a AWS Lambda function, which performs the following steps on each file in the specified bucket/prefix:
- Read the text from the S3 object
- Call Amazon Translate to convert the text into English language. This step uses “auto” as the source language for translation, to automatically detect the language of the text. The detected source language is captured for a later step.
- The translated text is sent to Amazon Comprehend Medical for entity detection.
- The entities identified by Amazon Comprehend Medical are sent to Amazon Translate to be converted back into the original language, using the language identified in Step b above.
- The translated entities are written to an output object that is stored in a results bucket in S3.
- A presigned URL with an expiration time of five minutes is generated and provided as output, in case you would like to access the results file directly.
- A separate S3 bucket is created by the CloudFormation script to capture the output from all runs of the solution. Output files are csv formatted, and include a timestamp in the name to identify when they were run.
- An Athena database and table are created with the results bucket from Step 4 as the data source. This allows for ad hoc querying and visualization of the full set of results data from all runs of the solution.
- Amazon QuickSight can be used as a tool to provide visualization of the results with the Athena table as the data source.
Implementation steps
We use an AWS CloudFormation stack to deploy the solution. The CloudFormation stack creates the following required resources:
- Amazon S3 bucket for results files
- AWS Lambda function to perform the translation and analysis
- Amazon API Gateway API for calling the Lambda function
- AWS Lambda function to be used as a CloudFormation custom resource for creating the Amazon Athena database and table
- Amazon Athena database and table to query the results S3 bucket
- All necessary AWS Identity and Access Management (IAM) roles and policies
Prerequisites
Download the contents of the GitHub repository located here: https://github.com/aws-samples/amazon-translate-with-comprehend-medical. In the root directory, run the “makeZip.sh” script to create two zip files: translate-cm.zip and translate-cm-athena.zip. Place these two zip files and the translate_cm-cfn.yml CloudFormation template into an S3 bucket in your environment. Ensure that the user deploying the solution has access to the bucket containing these files.
Deployment
- Log into the AWS Management Console with your IAM username and password. Navigate to the CloudFormation service. Click on the “Create stack” button in the upper right corner. Select With new resources (standard).
- On the Create Stack page, add the S3 URL to the CloudFormation template object that was uploaded in the prerequisite steps, and choose the Next button in the bottom right.
- On the Specify stack details page, provide the following information and then choose Next:
- Provide a name for your stack in the Stack Name text box
- In the AllowedIPs text box, provide a CIDR block of IP addresses that should have access to the endpoint for the solution
- In the CodeBucket text box, provide the name of the bucket that the zip files were uploaded to in the prerequisites for this deployment
- On the Configure stack options page, leave all values as their defaults, and choose the Next
- On the Review page, choose the checkbox next to the statement “I acknowledge that AWS CloudFormation might create IAM resources.” at the bottom of the page, and choose Create stack.
- Wait for the stack to reach a status of “CREATE_COMPLETE”. You can look at the Events tab to track the status of the creation of each element of the stack. The Resources tab provides the Physical IDs for each of the elements created. Once the stack is created, look at the Outputs This contains the endpoint URL for the solution described in this blog post.
Run the solution and examine results
To run the solution, you need to have clinical notes stored in an S3 bucket that is accessible to you. Get the endpoint URL from the Outputs tab of the CloudFormation stack, and paste it into a web browser or use curl to send the API request. Provide the required query parameter “bucket” to specify the location of the clinical files in S3, and the optional “prefix” query parameter if it is stored in a lower level folder.
Browser:
curl:
This command runs all notes in the specified bucket/prefix through the solution and write the results file containing the entities identified by Comprehend Medical into the results bucket created by CloudFormation. This bucket is the data source for a database and table in Athena that can be used to query the results. Log into the AWS Management Console, and navigate to Athena. Look for a database called “translatecm”, which has a single table called “results”. Select all entries from this table to see the results of the command above.
Visualize the results with Amazon QuickSight
Results generated by this solution can be visualized with Amazon QuickSight using the Athena table created as the data source using the steps below:
- Launch the AWS Management Console and navigate to QuickSight. Choose New analysis in the upper left corner.
- Choose New data set in the upper left corner to create a dataset for the results in the Athena table.
- Select Athena as the data source. Enter a name for this data source in the Data Source Name text box, and choose Validate connection to make sure you can connect to Athena. Choose Create data source.
- In the Choose your table dialog, select “translate_cm_results” from the “Database:” dropdown, and then select “results” in the “Tables” box. Choose Select.
- In the Finish data set creation dialog box, you have two choices:
- Import to SPICE for quicker analytics
- Directly query your data
SPICE is the in-memory optimized calculation engine for Amazon QuickSight, designed specifically for fast, ad hoc data visualization. Data in SPICE is not automatically updated, so while you will experience better performance than directly querying your data, you will need to manually refresh the data after every run of the solution. You can also schedule refreshes of SPICE Daily, Weekly, or Monthly with Standard or Enterprise QuickSight, or Hourly with Enterprise only. Selecting Directly query your data dynamically updates your visualizations with the most current data every time you open them. Make the selection that best fits your workflow, and choose Visualize.
- You can now use QuickSight to generate the visuals of your choice, such as word clouds, bar charts, and more. These can be saved as a dashboard, which can be dynamically updated after data is processed when directly querying data, or updated on a schedule when using SPICE.
Conclusion
In this blog post, you saw how you can use Amazon Translate, in conjunction with Amazon Comprehend Medical, to extract clinical entities from medical text in multiple languages, and then use Amazon Athena and Amazon QuickSight to visualize the results. This allows for the accurate and cost-effective mining of clinical data from large amounts of medical texts. This also provides a robust set of data that can be combined with other sources of patient information to provide a more comprehensive view of a patient to improve outcomes, and visualizations that allow for the quick identification of trends within the data.
This solution currently only extracts the entities, categories, and types from the medical text. It could be easily extended to capture attribute, traits, and identify PHI by adding the appropriate calls to Comprehend Medical. The solution presented in this blog processes batches of clinical text only when the endpoint is called by a user. It would be straightforward to add an S3 event that would call the Lambda function for this solution when a note is added to a specific bucket, creating an event driven architecture for data processing.