Generate synchronized closed captions and audio using the Amazon Polly subtitle generator

Amazon Polly, an AI generated text-to-speech service, enables you to automate and scale your interactive voice solutions, helping to improve productivity and reduce costs.

As our customers continue to use Amazon Polly for its rich set of features and ease of use, we have observed a demand for the ability to simultaneously generate synchronized audio and subtitles or closed captions for a given text input. At AWS, we continuously work backward from our customer asks, so in this post, we outline a method to generate audio and subtitles at the same time for a given text.

Although subtitles and captions are often used interchangeably, including in this post, there are subtle differences among them:

Subtitles – In subtitles, text language displayed on the screen is different from the audio language and doesn’t display anything for non-dialogue like significant sounds. The primary objective is to reach the audience that doesn’t speak the audio language in the video.
Captions (closed/open) – Captions display the dialogues being spoken in the audio in the same language. Its primary purpose is to increase accessibility in cases where the audio can’t be heard by the end consumer due to a range of issues. Closed captions are part of a different file than the audio/video source and can be turned off and on at the user’s discretion, whereas open captions are part of the video file and can’t be turned off by the user.

Benefits of using Amazon Polly to generate audio with subtitles or closed captions

Imagine the following use case: you prepare a slide-based presentation for an online learning portal. Each slide includes onscreen content and narration. The onscreen content is a basic outline, and the narration goes into detail. Instead of recording a human voice, which can be cumbersome and inconsistent, you can use Amazon Polly to generate the narration. Amazon Polly produces high-quality, consistent voices. There’s no need for post-production. In the future, if you need to update a portion of the presentation, you only need to update the affected slides. The voice matches the original slides. Additionally, when Amazon Polly generates your audio, captions are included that appear in time with the audio. You save time because there’s no manual recording involved, and save additional time when updates are needed. Your presentation also delivers more value because captions help students consume the content. It’s a win-win-win solution.

There are a multitude of use cases for captions, such as advertisements in social spaces, gymnasiums, coffee shops, and other places where typically there is something on a television with the audio muted and music in the background; online training and classes; virtual meetings; public electronic announcements; watching videos while commuting without headphones and without disturbing co-passengers; and several more.

Irrespective of the field of application, closed captioning can help with the following:

Accessibility – People with hearing impairments can better consume your content.
Retention – Online learning is easier for e-learners to grasp and retain when more human senses are involved.
Reachability – Your content can reach people that have competing priorities, such as gaming and watching news simultaneously, or people who have a different native language than the audio language.
Searchability – The content is searchable by search engines. Whereas videos can’t be searched optimally by most search engines, search engines can use the caption text files and make your content more discoverable.
Social courtesy – Sometimes it may be rude to play audio because of your surroundings, or the audio could be difficult to hear because of the noise of your environment.
Comprehension – The content is easier to comprehend irrespective of the accent of the speaker, native language of the speaker, or speed of speech. You can also take notes without repeatedly watching the same scene.

Solution overview

The library presented in this post uses Amazon Polly to generate sound and closed captions for an input text. You can easily integrate this library in your text-to-speech applications. It supports several audio formats, and captions in both VTT and SRT file formats, which are the most commonly used across the industry.

In this post, we focus on the PollyVTT() syntax and options, and offer a few examples that demonstrate how to use the Python SubtitleGeneratorForPolly to simultaneously generate synchronous audio and subtitle files for a given text input. The output audio file format can be PCM(wav), OGG, or MP3, and the subtitle file format can be VTT or SRT. Furthermore, SubtitleGeneratorForPolly supports all Amazon Polly synthesize_speech parameters and adds to the rich Amazon Polly feature set.

The polly-vtt library and its dependencies are available on GitHub.

Install and use the function

Before we look at some examples of using PollyVTT(), the function that powers SubtitleGeneratorForPolly, let’s look at the installation and syntax of it.

Install the library using the following code:

pip install

To run from the command line, you simply run polly-vtt:

Usage: polly-vtt [OPTIONS] BASE_FILENAME VOICE_ID OUTPUT_FORMAT TEXT

The following code shows your options:

--caption-format TEXT 'srt' or 'vtt'
--help Show this message and exit. 

BASE_FILENAME: Base filename for both the audio and caption files 
VOICE_ID: Polly voice to use (Case-sensitive)
OUTPUT_FORMAT: Amazon Polly output format: pcm, mp3, ogg_vorbis 
TEXT: Full text to be digitized 
Caption format: srt or vtt

Let’s look at a few examples now.

Example 1

This example generates a PCM audio file along with an SRT caption file for two simple sentences:

$ polly-vtt testfile Joanna pcm "this is a test. this is a second sentence." --caption-format srt 

testfile.wav written successfully.
testfile.wav.srt written successfully.
Total Audio Length: 0:00:03.017500 
# of Sentences: 2

Example 2

This example demonstrates how to use a paragraph of text as input. This generates audio files in WAV, MP3, and OGG, and subtitles in SRT and VTT. The following example creates six files for the given input text:

pcm_testfile.wav
pcm_testfile.wav.vtt
mp3_testfile.mp3
mp3_testfile.mp3.vtt
ogg_testfile.ogg
ogg_testfile.ogg.srt

See the following code:

from polly_vtt import PollyVTT 

text = "News content is shaped by its own unique characteristics. Sentences and paragraphs are usually short and highly in formative because writers have to compress information into a limited space. Depending on the theme, news articles may con tain relevant terminology, place names, abbreviations, people’s names, and quotes. Excellent news writing is clear, precis e, and avoids ambiguity. The writing is dynamic, especially in online articles, because content may get updated multiple times per day as new information becomes available." 

polly_vtt = PollyVTT() 

# pcm with VTT captions 
polly_vtt.generate( 
"pcm_testfile", 
Text=text, 
VoiceId="Joanna", 
OutputFormat="pcm", 
) 

# mp3 with VTT captions 
polly_vtt.generate( 
"mp3_testfile", 
Text=text, 
VoiceId="Joanna", 
OutputFormat="mp3", 
)
 
# ogg with SRT captions 
polly_vtt.generate( 
"ogg_testfile", 
"srt",
Text=text, 
VoiceId="Joanna", 
OutputFormat="ogg_vorbis", 
)

Example 3

In most cases, however, you want to pass the text as an input file. The following is a Python example of this, with the same output as the previous example:

from polly_vtt import PollyVTT
import os
import boto3
import json

polly_vtt = PollyVTT()

try:
	f=open("input.txt", "r")
	print("file is opened")
	polly_vtt.generate(
	"pcm_testfile",
	Text=f.read(),
	VoiceId="Joanna",
	OutputFormat="pcm",
	)
	f.close()
except:
	print("error occurred while converting to PCM")
print("end of file")

# mp3 with VTT captions
try:
	f=open("input.txt", "r")
	print("file is opened")
	polly_vtt.generate(
	"mp3_testfile",
	Text=f.read(),
	VoiceId="Joanna",
	OutputFormat="mp3",
	)
	f.close()
except:
	print("error occurred while converting to MP3")
print("end of file")

# ogg with SRT captions
try:
	f=open("input.txt", "r")
	print("file is opened")
	polly_vtt.generate(
	"ogg_testfile",
	"srt",
	Text=f.read(),
	VoiceId="Joanna",
	OutputFormat="ogg_vorbis",
	)
	f.close()
except:
	print("error occurred while converting to OGG")
print("end of file")

The following is a testimonial post from the AWS internal training team of using Amazon Polly with closed captions:

The following video offers a short demo of how the internal training team at AWS uses PollyVTT():

Conclusion

In this post, we shared a method to generate audio and subtitles at the same time for a given text. The PollyVTT() function and SubtitleGeneratorForPolly address a common requirement for subtitles in an efficient and effective manner. The Amazon Polly team continues to invent and offer simplified solutions to complex customer requirements.

For more tutorials and information about Amazon Polly, check out the AWS Machine Learning Blog.

About the Authors

Abhishek Soni is a Partner Solutions Architect at AWS. He works with customers to provide technical guidance for the best outcome of workloads on AWS.

Dan McKee uses audio, video, and coffee to distill content into targeted, modular, and structured courses. In his role as Curriculum Developer Project Manager for the NetSec Domain at Amazon Web Services, he leverages his experience in Data Center Networking to help subject matter experts bring ideas to life.

Orlando Karam is a Technical Curriculum Developer at Amazon Web Services, which means he gets to play with cool new technologies and then talk about it. Occasionally, he also uses those cool technologies to make his job easier.

AWS Machine Learning Blog