AWS Machine Learning Blog

Customize Amazon Translate output to meet your domain and organization specific vocabulary

Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. When you translate from one language to another, you want your machine translation to be accurate, fluent, and most importantly contextual. Customization is key in keeping your machine translation contextual. Amazon Translate provides multiple capabilities for customization to achieve the best machine translation. One such capability is custom terminology. Custom terminology enables you to customize your translation output such that your domain and organization specific vocabulary such as brand names, character names, model names, and other unique content (named entities) are translated exactly the way you need. To use the custom terminology feature, you create a terminology using a terminology file in a CSV or TMX file format and specify this custom terminology as a parameter in an Amazon Translate real-time translation or asynchronous batch processing request.

Amazon Translate now supports multi-directional custom terminology. You no longer have to create multiple terminology CSV files with each one differing only in the first column to indicate the source language, include additional preprocessing logic to identify the dominant language, and choose the correct terminology file for the translation request. You can now use a single custom terminology for multiple source and target language combinations. Even when you set the source language to be detected automatically, Amazon Translate uses Amazon Comprehend to determine the dominant language of the source material, uses it as the source language, and translates the text using the terms specified in the custom terminology. For additional details on custom terminology, refer to Customizing Your Translations with Custom Terminology.

In this post, we walk you through the step-by-step process of how to use custom terminology and get a customized machine translated output securely.

Solution overview

To customize your translation for terms that are unique to your industry domain or organization, you define these terms in a terminology file in CSV or TMX file format. The terms within the custom terminology are considered case-sensitive, and Amazon Translate identifies an exact match between a terminology entry and a string in the source text when their case matches.

For our use case, we have our data in CSV format, and the name of the file is custom_terminology.csv. The data in the file should also be UTF-8 encoded. The following table summarizes the contents of the file.

en es fr hi ta
Echo Echo Echo Echo Echo
Show Show Show Show Show
Amazon Amazon Amazon Amazon Amazon
Alexa Alexa Alexa Alexa Alexa
AZ2 AZ2 AZ2 AZ2 AZ2

Import terminology

First, we import our multi-directional custom terminology using the custom_terminology.csv file. In the following sections, we show you how to import your terminology via the AWS Management Console, AWS Command Line Interface (AWS CLI), or with the Amazon Translate SDK (Python Boto3).

Amazon Translate console

To import the terminology via the console, complete the following steps:

  1. On the Amazon Translate console, in the navigation pane, choose Custom terminology.
  2. Choose Create terminology.

  1. For Name, enter an appropriate name, for example CustomTerminologyDemo.
  2. For Terminology file, upload the custom_terminology.csv file.
  3. For Terminology file data format, choose CSV, since we uploaded a CSV file.
  4. For Directionality, choose Multi-directional.
  5. For Encryption key, for the purpose of this post, we leave it as default, an AWS owned and managed key. You can select any appropriate key.

 Your data is always secure with Amazon Translate. It’s encrypted using an AWS owned encryption key using AWS Key Management Service (AWS KMS) by default. You can encrypt it using a key from your current account or use a key from a different account.

  1. Choose Create Terminology.

Your custom terminology is now listed on the Custom terminology page.

AWS CLI

The following AWS CLI commands are formatted for Unix, Linux, and macOS. For Windows, replace the backslash (\) Unix continuation character at the end of each line with a caret (^).

You can call the import-terminology AWS CLI command to create a custom terminology resource:

aws translate import-terminology \
--region us-east-1 \
--name CustomTerminologyDemo \
--description "Multi-Directional custom terminology in AWS Translate" \
--merge-strategy OVERWRITE \
--data-file fileb://custom_terminology.csv 
--terminology-data Format=CSV,Directionality=MULTI 

You get a response like the following snippet:

{
    "TerminologyProperties": {
        "Name": "CustomTerminologyDemo",
        "Description": "Multi-Directional custom terminology in AWS Translate",
        "Arn": "arn:aws:translate:us-east-1:123456789012:terminology/CustomTerminologyDemo/LATEST",
        "Directionality": "MULTI"
        "SourceLanguageCode": "en",
        "TargetLanguageCodes": [
            "hi",
            "fr",
            "ta",
            "es"
        ],
        "SizeBytes": 136,
        "TermCount": 20, 
        "CreatedAt": "2021-10-12T15:29:51.294000-04:00",
        "LastUpdatedAt": "2021-10-12T15:29:51.458000-04:00"
    }
}

You can use the list-terminologies command to list all the custom terminology created:

aws translate get-terminology --name CustomTerminologyDemo –-region us-east-1

The response looks like the following:

{
    "TerminologyPropertiesList": [
        {
            "Name": "CustomTerminologyDemo",
            "Arn": "arn:aws:translate:us-east-1:123456789012:terminology/CustomTerminologyDemo/LATEST",
            "SourceLanguageCode": "en",
            "TargetLanguageCodes": [
            "hi",
            "ta",
            "fr",
            "es"
            ],
            "SizeBytes": 157,
            "TermCount": 20,
            "CreatedAt": "2021-10-12T15:29:51.294000-04:00",
            "LastUpdatedAt": "2021-10-12T15:29:51.458000-04:00",
 		"Directionality": "MULTI",
		"Format": "CSV"		   
        }
    ]
}

You can use the get-terminology command to get the details of a specific custom terminology resource:

aws translate get-terminology --name CustomTerminologyDemo --terminology-data-format CSV –region us-east-1

The response looks like the following:

{
    "TerminologyProperties": {
        "Name": "CustomTerminologyDemo",
        "Description": "Custom terminology in AWS Translate",
        "Arn": "arn:aws:translate:us-east-1:123456789012:terminology/CustomTerminologyDemo/LATEST",
        "Format": "CSV",
        "Directionality": "MULTI"
        "SourceLanguageCode": "en",
        "TargetLanguageCodes": [
            "hi",
            "fr",
            "ta",
            "es"
        ],
        "SizeBytes": 136,
        "TermCount": 20, 
        "CreatedAt": "2021-10-12T15:29:51.294000-04:00",
        "LastUpdatedAt": "2021-10-12T15:29:51.458000-04:00"
    },
    "TerminologyDataLocation": {
        "RepositoryType": "S3",
        "Location": "https://aws-translate-terminology-prod-us-east-1.s3.us-east-1.amazonaws.com/123456789012/CustomTerminologyDemo/LATEST/c5c307b8-30f3-4704-8e39-ca4e9330ff6f/CSV/Custom_terminology.csv?X-Amz-Security-Token=1111222233334444aaaaeeeefffff&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20211022T150354Z&X-Amz-SignedHeaders=host&X-Amz-Expires=1800&X-Amz-Credential=ASIA1a2b3c4d5e6f%2F20211022%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=aaaabbbb11112222"
    }
}

To delete a custom terminology resource, you can use the delete-terminology command:

aws translate delete-terminology --name CustomTerminologyDemo –region us-east-1

Amazon Translate SDK (Python Boto3)

The following Python 3 code creates a custom terminology, lists all the custom terminology, and uses the terminology resource part of the real-time translation call:

import boto3
import json

translate = boto3.client('translate')

with open('custom_terminology.csv', 'rb') as ct_file:
    translate.import_terminology(
        Name='CustomTerminology_boto3',
        MergeStrategy='OVERWRITE',
        Description='Terminology for Demo through boto3',
        TerminologyData={
            'File': ct_file.read(),
            'Format': 'CSV'
            'Directionality': 'MULTI'
        }
    )

response = translate.list_terminologies()
terminology_names = [tag["Name"] for tag in response["TerminologyPropertiesList"]]
print(str(terminology_names))

response = translate.get_terminology(
    Name='CustomTerminology_boto3',
    TerminologyDataFormat='CSV'
)
print("Name:{}".format(response["TerminologyProperties"]["Name"]))
print("Description:{}".format(response["TerminologyProperties"]["Description"]))
print("ARN:{}".format(response["TerminologyProperties"]["Arn"]))
print("Directionality:{}".format(response["TerminologyProperties"]["Directionality"]))

SOURCE_TEXT = ("Amazon a présenté aujourd'hui Echo Show 15, un nouvel ajout à la famille Echo Show qui est conçu pour être le cœur numérique de votre maison")

OUTPUT_LANG_CODE = 'en'

result = translate.translate_text(
    Text=SOURCE_TEXT,
    TerminologyNames=['CustomTerminology_boto3'], 
    SourceLanguageCode='auto',
    TargetLanguageCode=OUTPUT_LANG_CODE
)

print("Translated Text:{}".format(result['TranslatedText']))

Running the Python code prints the following result:

python translate_custom_terminology.py
 
['CustomTerminology_boto3']
Name: CustomTerminology_boto3
Description: Terminology for Demo through boto3
ARN: arn:aws:translate:us-east-1:123456789012:terminology/CustomTerminology_boto3/LATEST
Directionality: MULTI

Translated Text: Amazon today introduced Echo Show 15, a new addition to the Echo Show family that is designed to be the digital heart of your home.

Real-time translation using multi-directional custom terminology

In this section, we demonstrate two use cases using multi-directional custom terminology for real-time translation in Amazon Translate.

Scenario 1: Multi-directional custom terminology

For a basic demonstration of using multi-directional custom terminology with real-time translation, we use the following sample text in Spanish to be translated to French.

Amazon ha presentado hoy el Echo Show 15, una nueva incorporación a la familia Echo Show que está diseñada para ser el corazón digital de tu hogar. Con una pantalla Full HD de 15,6 pulgadas y 1080p, el Echo Show 15 puede fijarse en la pared o colocarse sobre un soporte compatible, ya sea en orientación vertical u horizontal, y está diseñado para ayudarte a mantenerte organizado, conectado y entretenido. El Echo Show 15 está fabricado con el procesador Amazon AZ2 Neural Edge de última generación, una pantalla de inicio rediseñada con más opciones de personalización, nuevas funcionalidades de personalización con ID Visual, y experiencias de Alexa totalmente nuevas.

On the Amazon Translate console, complete the following steps:

  1. Choose Spanish (es) as the Source language.
  2. Choose French (fr) as the Target Language.
  3. In the Additional settings section, turn Custom terminology
  4. Choose CustomTerminologyDemo as the terminology.
  5. Enter the provided sample text in the Source Language text area.

The following screenshot shows the translated text with custom terminology applied.

Spanish wasn’t the first column in the terminology file we uploaded, but with multi-directional terminology support, Amazon Translate was able to use the supplied terminology file to customize the translation.

Scenario 2: Automatically detect source language

In this use case, we demonstrate the capability in Amazon Translate to automatically detect the source language and use the supplied terminology file to customize the translation. We use the following sample text in French and translate it to Hindi:

Aujourd’hui, Amazon présente Echo Show 15, dernier-né de la gamme Echo Show, imaginé pour être le cœur numérique de votre domicile. Avec un écran Full HD 1080p de 15,6’’, Echo Show 15 peut être fixé au mur ou posé sur un support compatible, en orientation portrait ou paysage, et est conçu pour vous aider à rester organisé·e, connecté·e et diverti·e. Echo Show 15 est équipé du processeur Amazon AZ2 Neural Edge de nouvelle génération, d’un écran d’accueil repensé avec plus d’options et de nouvelles fonctionnalités de personnalisation grâce à l’identifiant facial, et bénéficie de toutes nouvelles expériences Alexa.

First let’s demonstrate the translation without custom terminology.

  1. Choose Source language as Auto (auto).
  2. Choose Hindi (hi) as the Target Language.
  3. Enter the provided text in the Source Language text area.

The following screenshot shows the translated text.

Words like Amazon, Echo, Show, AZ2, and Alexa have been translated into Devanagari script.

Let’s perform the same translation using our multi-directional custom terminology.

  1. Choose Source language as Auto (auto).
  2. Choose Hindi (hi) as the Target Language.
  3. In the Additional settings section, turn Custom terminology
  4. Choose CustomTerminologyDemo as the terminology.
  5. Enter the provided text in the Source Language text area.

The following screenshot shows the translated text with custom terminology applied.

The source language was automatically detected as French, and with the multi-directional custom terminology support, Amazon Translate was able to use the supplied terminology file to customize the translation and retain the Latin script for words like Amazon, Echo, Show, AZ2, and Alexa.

Conclusion

When you use custom terminology with translation requests, you can make sure that your unique content, such as brand names, character names, and model names, is translated exactly the way you need it, regardless of context and the Amazon Translate algorithm’s decision. In addition, with multi-directional custom terminology, the management overhead of maintaining multiple terminologies is drastically reduced, and you can use a single terminology to translate to and from a specific language. For more information about how to get the best translation quality when using custom terminology, see Best Practices.


About the Authors

Siva Rajamani is a Boston-based Enterprise Solutions Architect at AWS. He enjoys working closely with customers and supporting their digital transformation and AWS adoption journey. His core areas of focus are serverless, application integration, and security. Outside of work, he enjoys outdoors activities and watching documentaries.

Sudhanshu Malhotra is a Boston-based Enterprise Solutions Architect for AWS. He’s a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are DevOps, machine learning, and security. When he’s not working with customers on their journey to the cloud, he enjoys reading, hiking, and exploring new cuisines.

Watson G. Srivathsan is the Sr. Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends you will find him exploring the outdoors in the Pacific Northwest.