AWS Machine Learning Blog
Translating documents with Amazon Translate, AWS Lambda, and the new Batch Translate API
With an increasing number of digital text documents shared across the world for both business and personal reasons, the need for translation capabilities becomes even more critical. There are multiple tools available online that enable people to copy/paste text and get the translated equivalent in the language of their choice. While this is a great way to perform ad hoc translation of a (limited) amount of text, it can be tedious and time-consuming if performed frequently.
Your organization may largely depend on content to document your products and services, teach your customers how to interact with you, or just share the cool things you are doing. This content is often text-heavy and mostly written in English. This makes it hard for people without adequate knowledge of the language to understand it, which can directly impact the relationship with your customers. You need an automated solution that can translate a set of documents from one language to another quickly and cost-efficiently.
In this blog post, we walk through two different solutions to translate documents – a simple approach to translate a batch of document asynchronously using asynchronous Batch Translation and an advanced approach to translate documents synchronously as they arrive using AWS Lambda and Amazon Real-Time Translation. You can use the option that best fits your needs.
Simple approach using asynchronous Batch Translation
Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and more natural sounding translation than traditional statistical and rule-based translation algorithms. The translation service is trained on a wide variety of content across different use cases and domains to perform well on many kinds of content. Visit the Amazon Translate product page for more information.
Amazon Translate recently introduced asynchronous Batch Translation that enables you to translate a large collection of text or HTML documents. You can translate this collection of documents from one language to another with just a single API call. You can run the asynchronous Batch Translation daily to localize your documentation, teaching material, and blogs to the language of your choice. Additionally, you can monitor the progress of the batch translation job and retrieve its results from a specified output folder. Let us review how you can use asynchronous batch translation.
We use the three text files listed below to review Amazon Translate batch translation. You can find and download the text files from here.
Text file 1:
Text file 2:
Text file 3:
To create a batch translation job using this example text, complete the following steps:
- Create an S3 bucket in us-east-1 and provide a unique name, such as
translate-job-batch-input
- Create a folder inside the bucket and name it
raw
- Upload the text files that need to be translated in
s3://translate-job-batch-input/raw/
. This bucket will contain the input to Batch Translation. - Create another S3 bucket in us-east-1 and provide a unique name, such as
translate-job-batch-output/
- Create a folder inside the bucket and name it
output
- The output from the Batch Translation will be saved in
s3://translate-job-batch-output/output/
. - On the Amazon Translate console, choose Batch Translation.
- Choose Create job.
- For Name, enter
MyTranslationJob
. - For Source language, choose English.
- For Target language, choose German.
- For Input S3 location, enter
s3://translate-job-batch-input/raw/
. - For File Format, choose txt.
- For Output S3 location, enter
s3://translate-job-batch-output/output/
. - For Access permissions, select Create an IAM role.
- For IAM role, select Input and output S3 buckets.
- For Role name, enter
translate-batch-role
. - Choose Create job.
After you create the job, you can see the job in progress on the Amazon Translate console. See the following screenshot.
When the job is finished, you see the status Completed
and your translated documents appear in your output S3 bucket. The following screenshot shows the details of your completed job.
The translated files are stored in output S3 location as shown below.
Advanced approach using AWS Lambda and Real-Time Translation
You can solve your content localization problem easily and cost-efficiently by running a batch translation job. However, there are situations where, you do not have the time to accumulate a batch of documents and call the asynchronous batch API periodically for your accumulated batch. In such cases, you need the translation to start translation as soon as the document is ready.
To achieve this goal, we use an event-driven architecture. When a new document is uploaded to a specific S3 bucket, we configure the setting on this S3 bucket to send a notification to AWS Lambda. This notification AWS Lambda to run a code to performs the following sequence of events – read the document upload to the S3 bucket, extract short segments from the document that can be passed through Real-Time Translation API, pass these segments through Real-Time Translation API, use the Real-Time Translation API’s output to rebuild the translated output document, and save the output in the specified output location.
The following diagram illustrates this architecture.
Launching the application using AWS CloudFormation
You can implement this solution easily, and cost effectively in your AWS account by launching the following AWS CloudFormation stack in the AWS CloudFormation console:
Deploying the application
To deploy the application, complete the following steps:
- On the AWS CloudFormation console, choose Create stack with new resources (standard)
- Select Amazon S3 URL and paste
https://s3.amazonaws.com/aws-ml-blog/artifacts/serverless-document-translation/translate-lambda-cfn-stack.yml
in Amazon S3 URL field and choose Next. - For Stack name, enter a unique stack name for this account; for example,
automated-document-translation
. - For IAMRoleName, enter a unique IAM role name for this account; for example,
TranslationLambdaExecRole
. The Lambda function assumes this role for accessing the required Amazon S3 and Amazon Translate APIs. This IAM role has two policies attached to it: a custom policy giving read/write permissions (GetObject
andPutObject
) on the input and output S3 buckets, and aTranslateReadOnly
policy managed by AWS to make API calls to Amazon Translate. - For LambdaFunctionName, enter a unique AWS Lambda function name; for example,
trigger-translation
. - For InputBucketName, enter a unique name for the Amazon S3 bucket the stack creates; for example,
raw-input-bucket
. The input documents are uploaded to this bucket before they are translated. Use only lower-case characters and no spaces when you create the name of the input bucket. Furthermore, this operation creates a new S3 bucket, so do not use the name of an existing bucket. For more information, see Rules for Bucket Naming. - For OutputBucketName, enter a unique name for your output bucket; for example,
translated-output-bucket
. This bucket stores the output documents after they are translated. Follow the same naming rules as your input bucket. - For SourceLanguageCode, enter the language code that you want your translated documents in; for example,
en
for English orauto
to detect dominant language - For TargetLanguageCode, enter the language code that you want your translated documents in; for example,
de
for German.For more information about supported language codes, see What is Amazon Translate? - Choose Next.
- On the Configure Stack Options page, choose any additional optional parameters, including tags for your stack.
- Choose Next.
- Select the check box I acknowledge that AWS CloudFormation might create IAM resources with custom names.
- Choose Create Stack.
It takes up to a minute for the stack creation to complete.
Using the application
After you create the AWS CloudFormation stack, you are ready to start using the solution.
You can upload a text document to translate into the input S3 bucket. This starts the workflow, and the translated document automatically appears in the output S3 bucket when complete.
Translated documents are stored in the output S3 bucket at the following path:
<TargetLanguageCode>/<original path of the source file>
. For example, if an input document titled FinalProposal.txt
was stored in an S3 folder named Marketing
inside the input S3 bucket, then its translated document in German will be stored at de/Marketing/FinalProposal.txt
inside the output bucket.
If you do not see the document in the output S3 bucket, check the Amazon CloudWatch Logs for the corresponding Lambda function and look for potential errors that caused the failure.
This solution only handles UTF-8 formatted text documents. You can modify the Python code in this post to handle different file formats. This solution is limited to the maximum execution time (timeout) for a Lambda function.
Conclusion
In this post, we show the implementation of two different solutions to translate documents using Amazon Translate – a simple approach using asynchronous Batch Translation and an advanced approach using AWS Lambda and Amazon Real-Time Translation. Come build your first translation job on Amazon Translate.
About the authors
Jay Rao is a Solutions Architect at AWS. He enjoys providing technical guidance to customers and helping them design and implement solutions on AWS.
Nikiforos Botis is a Solutions Architect at AWS. He enjoys helping his customers succeed in their cloud journey, especially when it comes to Machine Learning. Outside work, he loves traveling and playing basketball.
Watson G. Srivathsan is the Sr. Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends you will find him exploring the outdoors in the Pacific Northwest.