Amazon Transcribe – Accurate Speech To Text At Scale

Update (August 31, 2021) – Removed outdated S3 URLs in the console screenshot and the code.

Today we’re launching a private preview of Amazon Transcribe, an automatic speech recognition (ASR) service that makes it easy for developers to add speech to text capabilities to their applications. As bandwidth and connectivity improve, more and more of the world’s data is stored in video and audio formats. People are creating and consuming all of this data faster than ever before. It’s important for businesses to have some means of deriving value from all of that rich multimedia content. With Amazon Transcribe you can save on the costly process of manual transcription with an efficient and scalable API.

You can analyze audio files stored on Amazon Simple Storage Service (Amazon S3) in many common formats (WAV, MP3, Flac, etc.) by starting a job with the API. You’ll receive detailed and accurate transcriptions with timestamps for each word, as well as inferred punctuation. During the preview you can use the asynchronous transcription API to transcribe speech in English or Spanish.

Companies are looking to derive value from both their existing catalogs and their incoming data. By transcribing these stored media, companies can:

Analyze customer call data
Automate subtitle creation
Target advertising based on content
Enable rich search capabilities on archives of audio and video content

You can start a transcription job easily with the AWS Command Line Interface (AWS CLI), AWS SDKs, or the Amazon Transcribe console.

Amazon Transcribe currently has 3, mostly self-explanatory, API Actions:

StartTranscriptionJob
GetTranscriptionJob
ListTranscriptionJobs

Here’s a quick Python script that starts a job and polls until the job is finished:

from __future__ import print_function
import time
import boto3
transcribe = boto3.client('transcribe')
job_name = "RandallTest1"
job_uri = "https://s3-us-west-2.amazonaws.com/<Your-Bucket>/test.flac"
transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='flac',
    LanguageCode='en-US',
    MediaSampleRateHertz=44100
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
	print("Not ready yet...")
    time.sleep(5)
print(status)

The result of a completed job links to an Amazon Simple Storage Service (Amazon S3) presigned-url that contains our transcription in JSON format:

{
  "jobName": "RandallTest1",
  "results": {
  	"transcripts": [{"transcript": "Hello World", "confidence": 1}],
    "items": [
      {
      	"start_time": "0.880", "end_time": "1.300",
        "alternatives": [{"confidence": 0.91, "word": "Hello"}]
      },
      {
        "start_time": "1.400", "end_time": "1.620",
        "alternatives": [{"confidence": 0.84, "word": "World"}]
      }
  	]
  },
  "status": "COMPLETED"
}

As you can see you get timestamps and confidence scores for each word.

Whether alone or combined with other Amazon AI services this is a powerful service and I can’t wait to see what our customers build with it! Sign up for the preview today.

– Randall

P.S.
You might have noticed this lends itself well to AWS Step Functions and I thought the same. Here’s a workflow I might use:

AWS News Blog

Amazon Transcribe – Accurate Speech To Text At Scale

Resources

Follow