AWS Machine Learning Blog

Make your audio and video files searchable using Amazon Transcribe and Amazon Kendra

Updated May 2023 (v0.3.0): With this release MediaSearch indexer now supports indexing YouTube media and the MediaSearch Finder is enhanced to support playing YouTube videos inline.

The demand for audio and video media content is growing at an unprecedented rate. Organizations are using media to engage with their audiences like never before. Product documentation is increasingly published in video form, and podcasts are increasingly produced in place of blog posts. A number of content creators also publish their content on YouTube to reach a wider audience. The recent explosion in the use of virtual workplaces has resulted in content that is encapsulated in the form of recorded meetings, calls, and voicemails. Contact centers also generate media content such as support calls, screen share recordings, or post-call surveys.

Amazon Machine Learning services help you find answers and extract valuable insights from the content of your audio and video files as well as your text files.

In this post, we introduce a new open-source solution, MediaSearch, built to make your media files searchable, and consumable in search results. It uses Amazon Transcribe to convert media audio tracks to text, and Amazon Kendra to provide intelligent search. Your users can find the content they’re looking for, even when it’s embedded in the sound track of your audio or video files. The solution also provides an enhanced Amazon Kendra query application that lets users play the relevant section of original media files or YouTube video, directly from the search results page.

Solution overview

MediaSearch is easy to install and try out! Use it to enable your customers to find answers to their questions from your podcast recordings and presentations, or for your students to find answers from your educational videos or lecture recordings, in addition to text documents.

The MediaSearch solution has two components, as illustrated in the following diagram.

The first component, the MediaSearch Indexer finds and transcribes audio and video files stored in an Amazon Simple Storage Service (Amazon S3) bucket. The Indexer can also index YouTube videos from a YouTube playlist as audio files and transcribe these audio files. It prepares the transcriptions by embedding time markers at the start of each sentence, and it indexes each prepared transcription in a new or existing Amazon Kendra index. The Indexer runs the first time when you install it, and subsequently runs on an interval that you specify, maintaining the index to reflect any new, modified, or deleted files.

The second component, the MediaSearch Finder, is a sample web search client that you use to search for content in your Amazon Kendra index. It has all the features of a standard Amazon Kendra search page, but it also includes in-line embedded media players in the search result, so you can not only see the relevant section of the transcript, but also play the corresponding section from the original media (i.e., audio files and video files in your Media Bucket or a YouTube video) without navigating away from the search page (see the following screenshot).

In the sections that follow, we discuss several topics:

  • How to deploy the solution to your AWS account
  • How to use it to index and search sample media files
  • How to use the solution with your own media files
  • How the solution works under the hood
  • The costs involved
  • How to monitor usage and troubleshoot problems
  • Options to customize and tune the solution
  • How to uninstall and clean up when you’re done experimenting

Deploy the MediaSearch solution

In this section, we walk through deploying the two solution components: the indexer and the finder. We use an AWS CloudFormation stack to deploy the necessary resources in the us-east-1 (N. Virginia) AWS Region.

The source code is available in our GitHub repository. Follow the directions in the README to deploy MediaSearch to additional Regions supported by Amazon Kendra.

Deploy the indexer component

To deploy the indexer component, complete the following steps:

  1. Choose Launch Stack:
  2. Change the stack name if required to ensure that it’s unique.
  3. For ExistingIndexId, leave blank to create a new Amazon Kendra index (Developer Edition), otherwise provide the IndexId (not the index name) for an existing index in your account and Region (Amazon Kendra Enterprise Edition should be used for production workloads).
  4. For MediaBucket and MediaFolderPrefix, you could use the defaults (if deploying the solution in us-east-1) initially to transcribe and index sample audio and video files. If deploying into another region ensure that you enter a bucket name where your media files are stored. You could leave this parameter blank to prevent indexing any media file in your MediaBucket.
  5. If you would like to index your YouTube videos, the PlayListURL parameter must be set to your YouTube playlist url. If set, the indexer will download a  number (equivalent to the parameter value of NumberOfYTVideos) of latest YouTube videos as audio(mp3) files to a new YTMediaBucket bucket from the playlist. i.e. If NumberOfYTVideos is set to 5, then 5 latest YouTube videos from the YouTube playlist are indexed. Any YouTube videos indexed prior,  are ignored. To prevent YouTube videos from being indexed, either set the NumberOfYTVideos parameter value to 0 or leave the PlayListURL parameter as blank.
  6. Use the default values for the other parameters for now.
  7. Select the acknowledgement check boxes, and choose Create stack.
  8. When the stack is created (after approximately 20 minutes), choose the Outputs tab, and copy the value of IndexId—you need it to deploy the finder component in the next step.

The newly installed indexer first downloads YouTube videos as audio(mp3) files to the YTMediaBucket if PlayListURL has a valid YouTube playlist url and creates metadata for downloaded audio files on the YTMediaBucket at MetadataFolderPrefix location. The Indexer then runs automatically to find, transcribe, and index the YouTube audio files and metadata files stored in YTMediaBucket and other audio and video files stored in the MediaBucket (if provided as input). If you have media files in multiple buckets, you can deploy multiple instances of the indexer, each with a unique stack name.

The indexer can be deployed any of following ways:

  1. Enter a value for the MediaBucket and set PlayListURL to a valid YouTube playlist: In this case the Indexer indexes media files in the MediaBucket and also transcribes and indexes YouTube video transcriptions.
  2. Leave MediaBucket as blank and set PlayListURL to a valid YouTube playlist: In this case the Indexer transcribes and indexes YouTube media only.
  3. Enter a value for the MediaBucket and leave the PlayListURL as blank: In this case the Indexer indexes media files in the MediaBucket only.
  4. If the MediaBucket is left blank and PlayListURL is left blank, then no media files are indexed.

Deploy the finder component

To deploy the finder web application component, complete the following steps:

  1. Choose Launch Stack:
  2. For IndexId, use the Amazon Kendra index copied from the MediaSearch indexer stack outputs.
  3. For MediaBucketNames, use the MediaBucketsUsed output from the Indexer cloudformation stack to allow the search page to access media files across YTMediaBucket and Mediabucket. If you have multiple Indexer stacks deployed, ensure that the parameter MediaBucketNames contains names of all buckets where media files reside.
  4. For AdminEmail, use a valid email address, or leave empty if you choose to not enable authentication.
  5. For EnableAccessTokens and EnableGuestUser, use the default value, false.
  6. When the stack is created (after approximately 5 minutes), choose the Outputs tab and use the link for MediaSearchFinderURL to open the new media search application page in your browser.

If you chose to enable authentication, you will receive an email at the email address you provided with the subject Welcome to Finder Web App. This email contains a generated temporary password that you can use to log in and create your own password. Your username is admin. Your new password must have a length of at least 8 characters, and contain upper-case and lower-case characters, plus numbers and special characters.

If the application isn’t ready when you first open the page, don’t worry! The initial application build and deployment (using AWS Amplify) takes about 10 minutes, so it will work when you try again a little later. If for any reason the application still doesn’t open, refer to the README in the GitHub repo for troubleshooting steps.

And that’s all there is to the deployment! Next, let’s run some search queries to see it in action.

Test with the sample media files

By the time the MediaSearch finder application is deployed and ready to use, the indexer should have completed processing the audio(mp3) files for the YouTube videos and sample media files (selected AWS Podcast episodes and AWS Knowledge center videos). You can now run your first MediaSearch query.

  1. Open the MediaSearch finder application in your browser as described in the previous section.
  2. In the query box, enter What is Control Tower ?

The query returns multiple results, sourced from the transcripts of the sample media files and the YouTube audio(mp3) files.

  1. Observe the time markers at the beginning of each sentence in the answer text. This indicates where the answer is to be found in the original media file. The time markers are appended to the media file to play the media file from a specific point in time.
  2. While deploying the indexer if you opted to make the Category field facetable, you will be able to apply filters on the Category field on the MediaFinder application. Observe the categories on the screenshot above. Selecting YouTube video filters the Results to show only YouTube videos.
  3. Use the embedded video player to play the original video inline. Observe that the media playback starts at the relevant section of the video based on the time marker.
  4. To play the video full screen in a new browser tab, use the Fullscreen menu option in the player, or choose the media file hyperlink shown above the answer text.
  5. Choose the video file hyperlink (right-click), copy the URL, and paste it into a text editor. If the media is an audio file for a YouTube video it looks something like the following:
https://www.youtube.com/watch?v=unFVfqj9cQ8&t=36.58s

If the media file is a non-YouTube audio file that resides in the MediaBucket then the URL looks like below:

https://mediasearchtest.s3.amazonaws.com/mediasamples/What_is_an_Interface_VPC_Endpoint_and_how_can_I_create_Interface_Endpoint_for_my_VPC_.mp4?AWSAccessKeyId=ASIAXMBGHMGZLSYWJHGD&Expires=1625526197&Signature=BYeOXOzT585ntoXLDoftkfS4dBU%3D&x-amz-security-token=.... #t=253.52

This is a presigned S3 URL that provides your browser with temporary read access to the media file referenced in the search result. Using presigned URLs means you don’t need to provide permanent public access to all of your indexed media files.

  1. Scroll down the page, and observe that some search results are from audio (MP3) files, some are from YouTube and some other are video (MP4) files on the MediaBucket.

You can mix and match media types in the same index. You could include other data source types as well, such as documents, webpages, and other file types supported by available Amazon Kendra data sources, and search across them all, allowing Amazon Kendra to find the best content to answer your query.

  1. Experiment with additional queries, such as What does a solutions architect do? or What is Kendra?, or try your own questions.

Index and search your own media files

To index media files stored in your own S3 bucket, replace the MediaBucket and MediaFolderPrefix parameters with your own bucket name and prefix when you install or update the indexer component stack, and modify the MediaBucketName parameter with your own bucket name when you install or update the finder component stack. Additionally you can replace the YouTube playlist (PlayListURL) with your own playlist url and update the indexer stack.

  1. Create a new MediaSearch indexer stack using an existing Amazon Kendra IndexId to add files stored in the new location. To deploy a new indexer, follow the directions in the Deploy the indexer component section in this post, but this time replace the defaults to specify the media bucket name and prefix for your own media files or replace the YouTube playlist URL with your own playlist url. Ensure that you comply with the YouTube Terms of Service.
  2. Alternatively, update an existing MediaSearch indexer stack to replace the previously indexed files with files from the new location or update the YouTube playlist url and/or the number of videos to download from the playlist:
    1. Select the stack on the CloudFormation console, choose Update, then Use current template, then Next.
    2. Modify the media bucket name and prefix parameter values as needed.
    3. Modify the YouTube Playlist URL and Number of YouTube Videos values as needed.
    4. Choose Next twice, select the acknowledgement check box, and choose Update stack.
  3. Update an existing MediaSearch finder stack to change bucket names or add additional bucket names to the MediaBucketNames parameter.

When the MediaSearch indexer stack is successfully created or updated, the indexer automatically finds, transcribes, and indexes the media files stored in your S3 bucket. When it’s complete, you can submit queries and find answers from the audio tracks of your own audio and video files.

You have the option to provide metadata for any or all of your media files. Use metadata to assign values to index attributes for sorting, filtering, and faceting your search results, or to specify access control lists to govern access to the files. Metadata files can be in the same S3 folder as your media files (default), or in a parallel folder structure specified by the optional indexer parameter MetadataFolderPrefix. For more information about how to create metadata files, see S3 document metadata.

You can also provide customized transcription options for any or all of your media files. This allows you to take full advantage of Amazon Transcribe features such as custom vocabularies, automatic content redaction, and custom language models. For more information, refer to the README in the GitHub repo.

How the MediaSearch solution works

Let’s take a quick look under the hood to see how the solution works, as illustrated in the following diagram.

The MediaSearch solution has an event-driven serverless computing architecture with the following steps:

  1. You provide an Amazon S3 bucket containing the audio and video files you want to index and search. This is also known as the MediaBucket. Leave this blank if you do not want to index media from your MediaBucket.
  2. You also provide your YouTube playlist URL and the number of videos to index from the YouTube playlist. Ensure that you comply with the YouTube Terms of Service. The YTIndexer will index the latest files from the YouTube playlist. i.e. if the number of videos is set to 5, then the YTIndexer will index the 5 latest videos in the playlist. Any YouTube video indexed prior is ignored from being indexed.
  3. An AWS Lambda function fetches the YouTube videos from the playlist as audio (mp3 files) into the YTMediaBucket and also creates a metadata file in the MetadataFolderPrefix location with metadata for the YouTube video (2). The YouTube videoid along with the related metadata are recorded in a DynamoDB table (YTMediaDDBQueueTable) (3).
  4. Amazon EventBridge generates events on a repeating interval (e.g. every 2 hours, every 6 hours, etc.) These events invoke the AWS Lambda function (S3CrawlLambdaFunction) (4).
  5. An AWS Lambda function is invoked initially when the CloudFormation stack is first deployed, and then subsequently by the scheduled events from Amazon EventBridge (4). The S3CrawlLambdaFunction crawls through the MediaBucket and the YTMediabucket and starts a Kendra data source sync job. The Lambda function lists all the supported media files (FLAC, MP3, MP4, Ogg, WebM, AMR, or WAV) and associated metadata and transcribe options stored in the user provided S3 bucket (1).
  6. Each new file is added to another Amazon DynamoDB tracking table and submitted to be transcribed by a Transcribe job. Any file that has been previously transcribed is submitted for transcription again only if it has been modified since it was previously transcribed, or if associated Transcribe options have been updated. The DynamoDB table (6) is updated to reflect the transcription status and last modified timestamp of each file. Any tracked files that no longer exist in the S3 bucket are removed from the DynamoDB table and from the Amazon Kendra index. If no new or updated files are discovered, the Amazon Kendra data source sync job is immediately stopped. The DynamoDB table holds a record for each media file with attributes to track transcription job names and status, and last modified timestamps.
  7. As each Transcribe job completes, EventBridge generates a Job Complete event, which invokes an instance of another Lambda function(S3JobCompletionLambdaFunction).
  8. The Lambda function processes the transcription job output, generating a modified transcription that has a time marker inserted at the start of each sentence. This modified transcription is indexed in Amazon Kendra, and the job status for the file is updated in the DynamoDB table. When the last file has been transcribed and indexed, the Amazon Kendra data source sync job is stopped.
  9. The index is populated and kept in sync with the transcriptions of all the media files in the S3 bucket monitored by the MediaSearch indexer component, integrated with any additional content from any other provisioned data sources. The media transcriptions are used by Amazon Kendra’s intelligent query processing, which allows users to find content and answers to their questions.
  10. The sample finder client application enhances users’ search experience by embedding an inline media player with each Amazon Kendra answer that is based on a transcribed media file. The client uses the time markers embedded in the transcript to start media playback at the relevant section of the original media file.
  11. Optionally a Cognito user pool is used to authenticate users and to support OpenID token based access control, as determined by the Authentication and access control parameters provided when the finder client application stack is deployed.

Estimate costs

In addition to Amazon S3 costs associated with storing your media, the MediaSearch solution incurs usage costs from Amazon Kendra and Transcribe. Additional minor (usually not significant) costs are incurred by the other services mentioned after free tier allowances have been used. For more information, see the pricing documentation for Amazon Kendra, Transcribe, Lambda, DynamoDB, and EventBridge.

Pricing example: Index the sample media files

Considering a sample dataset of 25 media files—5 audio files for YouTube videos, 10 audio podcast and 10 video files—containing a total of around 480 minutes or 29,000 seconds of audio.

If you don’t provide an existing Amazon Kendra IndexId when you install MediaSearch, a new Amazon Kendra Developer Edition index is automatically created for you so you can test the solution. After you use your free tier allowance (up to 750 hours in the first 30 days), the index costs $1.125 per hour.

Transcribe pricing is based on the number of seconds of audio transcribed, with a free tier allowance of 60 minutes of audio per month for the first 12 months. After the free tier is used, the cost is $0.00040 for each second of audio transcribed. If you’re no longer free tier eligible, the cost to transcribe the sample files is as follows:

  • Total seconds of audio = 29,000
  • Transcription price per second = $0.00040
  • Total cost for Transcribe = [number of seconds] x [cost per second] = 29,000 x $0.00040 = $11.60

Monitor and troubleshoot

To see the details of each media file transcript job, navigate to the Transcription jobs page on the Transcribe console.

Each media file is transcribed only once, unless the file is modified. Modified files are re-transcribed and re-indexed to reflect the changes.

Choose any transcription job to review the transcription and examine additional job details.

On the Indexes page of the Amazon Kendra console, choose the index used by MediaSearch to examine the index details.

Choose Data sources in the navigation pane to examine the MediaSearch indexer data source, and observe the data source sync run history. The data source syncs when the indexer runs every interval specified in the CloudFormation stack parameters when you deployed or last updated the solution.

If elected to index YouTube videos while deploying the MediaSearch Indexer stack, additional facets are created in the Kendra index to hold the metadata for YouTube videos. These can be seen by selecting the Facet definition in the navigation pane.

On the DynamoDB console, choose Tables in the navigation pane. Use your MediaSearch stack name as a filter to display the MediaSearch DynamoDB tables, and examine the items showing each indexed media file and corresponding status.

The tableMediaSearch-Indexer-YTMediaDDBQueueTable has one record for each YouTube videoid that is downloaded as an audio(mp3) file along with the metadata for the video like author, view count, video title etc.

The tableMediaSearch-Indexer-MediaDynamoTable has one record for each media file(including YouTube videos), and contains attributes with information about the file and its processing status.

On the Functions page of the Lambda console, use your MediaSearch-Indexer stack name as a filter to list the lambda functions that are part of the solution.

The MediaSearch-Indexer-YouTubeVideoIndexer function indexes and downloads YouTube videos if the CloudFormation stack parameter PlayListURL is set to a valid YouTube playlist.

The MediaSearch-Indexer-S3CrawlLambdaFunction crawls the YTMediaBucket and the MediaBucket for media files and initiates the transcription jobs for the media files.

Once the transcription jobs finishes, a completion event triggers the MediaSearch-Indexer-S3JobCompletionLambdaFunction which ingests the transcription into the Kendra index with any related metadata.

Choose any of the functions to examine the function details, including environment variables, source code, and more. Choose Monitor & View logs in CloudWatch to examine the output of each function invocation and troubleshoot any issues.

Customize and enhance the solution

You can fork the MediaSearch GitHub repository, enhance the code, and send us pull requests so we can incorporate and share your improvements!

The following are a few suggestions for features you might want to implement:

Clean up

When you’re finished experimenting with this solution, clean up your resources by using the AWS CloudFormation console to delete the indexer and finder stacks that you deployed. This deletes all the resources, including any Amazon Kendra indexes that were created by deploying the solution. Pre-existing indexes aren’t deleted. However, media files that were indexed by the solution are removed from the pre-existing index when you delete the indexer stack.

Conclusion

The combination of Amazon Transcribe and Amazon Kendra enable a scalable, cost-effective solution to make your media files discoverable. You can use the content of your media files to find accurate answers to your users’ questions, whether they’re from text documents or media files, and consume them in their native format. In other words, this solution is a leap in bringing media files on par with text documents as containers of information.

The sample MediaSearch application is provided as open source—use it as a starting point for your own solution, and help us make it better by contributing back fixes and features via GitHub pull requests. For expert assistance, AWS Professional Services and other Amazon partners are here to help.

We’d love to hear from you. Let us know what you think in the comments section, or using the issues forum in the MediaSearch GitHub repository.


About the Authors

Bob StrahanBob Strahan is a Solutions Architecture Senior Manager in the Amazon Web Services Language AI Services team.

Abhinav JawadekarAbhinav Jawadekar is a Principal Solutions Architect at Amazon Web Services in the AWS Language AI Services team.

Roshan ThomasRoshan Thomas is a Senior Solutions Architect at Amazon Web Services. He is based in Melbourne, Australia and works closely with Enterprise customers to accelerate their journey in the cloud. He is passionate about technology and helping customers architect and build solutions on AWS