AWS Machine Learning Blog
Generate synchronized closed captions and audio using the Amazon Polly subtitle generator
Amazon Polly, an AI generated text-to-speech service, enables you to automate and scale your interactive voice solutions, helping to improve productivity and reduce costs.
As our customers continue to use Amazon Polly for its rich set of features and ease of use, we have observed a demand for the ability to simultaneously generate synchronized audio and subtitles or closed captions for a given text input. At AWS, we continuously work backward from our customer asks, so in this post, we outline a method to generate audio and subtitles at the same time for a given text.
Although subtitles and captions are often used interchangeably, including in this post, there are subtle differences among them:
- Subtitles – In subtitles, text language displayed on the screen is different from the audio language and doesn’t display anything for non-dialogue like significant sounds. The primary objective is to reach the audience that doesn’t speak the audio language in the video.
- Captions (closed/open) – Captions display the dialogues being spoken in the audio in the same language. Its primary purpose is to increase accessibility in cases where the audio can’t be heard by the end consumer due to a range of issues. Closed captions are part of a different file than the audio/video source and can be turned off and on at the user’s discretion, whereas open captions are part of the video file and can’t be turned off by the user.
Benefits of using Amazon Polly to generate audio with subtitles or closed captions
Imagine the following use case: you prepare a slide-based presentation for an online learning portal. Each slide includes onscreen content and narration. The onscreen content is a basic outline, and the narration goes into detail. Instead of recording a human voice, which can be cumbersome and inconsistent, you can use Amazon Polly to generate the narration. Amazon Polly produces high-quality, consistent voices. There’s no need for post-production. In the future, if you need to update a portion of the presentation, you only need to update the affected slides. The voice matches the original slides. Additionally, when Amazon Polly generates your audio, captions are included that appear in time with the audio. You save time because there’s no manual recording involved, and save additional time when updates are needed. Your presentation also delivers more value because captions help students consume the content. It’s a win-win-win solution.
There are a multitude of use cases for captions, such as advertisements in social spaces, gymnasiums, coffee shops, and other places where typically there is something on a television with the audio muted and music in the background; online training and classes; virtual meetings; public electronic announcements; watching videos while commuting without headphones and without disturbing co-passengers; and several more.
Irrespective of the field of application, closed captioning can help with the following:
- Accessibility – People with hearing impairments can better consume your content.
- Retention – Online learning is easier for e-learners to grasp and retain when more human senses are involved.
- Reachability – Your content can reach people that have competing priorities, such as gaming and watching news simultaneously, or people who have a different native language than the audio language.
- Searchability – The content is searchable by search engines. Whereas videos can’t be searched optimally by most search engines, search engines can use the caption text files and make your content more discoverable.
- Social courtesy – Sometimes it may be rude to play audio because of your surroundings, or the audio could be difficult to hear because of the noise of your environment.
- Comprehension – The content is easier to comprehend irrespective of the accent of the speaker, native language of the speaker, or speed of speech. You can also take notes without repeatedly watching the same scene.
Solution overview
The library presented in this post uses Amazon Polly to generate sound and closed captions for an input text. You can easily integrate this library in your text-to-speech applications. It supports several audio formats, and captions in both VTT and SRT file formats, which are the most commonly used across the industry.
In this post, we focus on the PollyVTT()
syntax and options, and offer a few examples that demonstrate how to use the Python SubtitleGeneratorForPolly
to simultaneously generate synchronous audio and subtitle files for a given text input. The output audio file format can be PCM(wav), OGG, or MP3, and the subtitle file format can be VTT or SRT. Furthermore, SubtitleGeneratorForPolly
supports all Amazon Polly synthesize_speech
parameters and adds to the rich Amazon Polly feature set.
The polly-vtt
library and its dependencies are available on GitHub.
Install and use the function
Before we look at some examples of using PollyVTT()
, the function that powers SubtitleGeneratorForPolly
, let’s look at the installation and syntax of it.
Install the library using the following code:
To run from the command line, you simply run polly-vtt
:
The following code shows your options:
Let’s look at a few examples now.
Example 1
This example generates a PCM audio file along with an SRT caption file for two simple sentences:
Example 2
This example demonstrates how to use a paragraph of text as input. This generates audio files in WAV, MP3, and OGG, and subtitles in SRT and VTT. The following example creates six files for the given input text:
pcm_testfile.wav
pcm_testfile.wav.vtt
mp3_testfile.mp3
mp3_testfile.mp3.vtt
ogg_testfile.ogg
ogg_testfile.ogg.srt
See the following code:
Example 3
In most cases, however, you want to pass the text as an input file. The following is a Python example of this, with the same output as the previous example:
The following is a testimonial post from the AWS internal training team of using Amazon Polly with closed captions:
The following video offers a short demo of how the internal training team at AWS uses PollyVTT()
:
Conclusion
In this post, we shared a method to generate audio and subtitles at the same time for a given text. The PollyVTT()
function and SubtitleGeneratorForPolly
address a common requirement for subtitles in an efficient and effective manner. The Amazon Polly team continues to invent and offer simplified solutions to complex customer requirements.
For more tutorials and information about Amazon Polly, check out the AWS Machine Learning Blog.
About the Authors
Abhishek Soni is a Partner Solutions Architect at AWS. He works with customers to provide technical guidance for the best outcome of workloads on AWS.
Dan McKee uses audio, video, and coffee to distill content into targeted, modular, and structured courses. In his role as Curriculum Developer Project Manager for the NetSec Domain at Amazon Web Services, he leverages his experience in Data Center Networking to help subject matter experts bring ideas to life.
Orlando Karam is a Technical Curriculum Developer at Amazon Web Services, which means he gets to play with cool new technologies and then talk about it. Occasionally, he also uses those cool technologies to make his job easier.