Find video clips faster with Amazon Rekognition and AWS Elemental MediaConvert

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.

News, entertainment, and daytime productions make frequent use of clips to create stories, to introduce guests, and for previews or highlights. Often there is a short time window for finding and selecting clips while a story or person is in the spotlight. This blog shows you how to find your video clips quickly from your archive using Amazon Rekognition and AWS Elemental MediaConvert.

Searching content archives to find a clip can be time consuming and difficult because indexing is almost always at file level, where each file is a complete show. To find the perfect clip, you have to search for potential files that might contain the person or visual you want and review the entire file to find any suitable clips. This is even more complex if you are looking for a combination of things, such as two celebrities on screen at the same time. All of this is done against time pressure and customers tell us they sometimes decide to go to another organization to buy a clip that they probably own but were not able to find quickly enough.

In this blog we outline how you can create a searchable index of video clips using Amazon Rekognition to identify segments and metadata, and AWS Elemental MediaConvert to clip the source file. By using that searchable index, you can find the right clips faster.

Solution

Creating the searchable index of clips requires three key steps:

1. Detect segments, labels and people using Amazon Rekognition Video
2. Index the metadata for each clip using Amazon Elasticsearch Service
3. Create individual proxy video clips from the main file using AWS Elemental MediaConvert

The first step uses Amazon Rekognition Video to detect labels, people, and segments. Amazon Rekognition Video makes it easy to add image and video analysis to your applications using proven, highly scalable, deep learning technology that requires no machine learning expertise to use. With Amazon Rekognition, you can identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content. This solution makes use of the Celebrity Recognition, Face Search and Label Detection API calls in Amazon Rekognition to asynchronously detect celebrities, labels and faces in videos.

The Amazon Rekognition Segment API uses Machine Learning (ML) to identify shot boundaries (camera shot changes) and technical cues such as end credits, black frames, color bars of videos stored in an Amazon S3 bucket. The segment detection provides frame accurate timecodes and supports SMPTE (Drop Frame and Non-Drop Frame) timecodes. You can get the start and end timecodes, and the duration of each shot boundary and technical cue event. You can learn more about the video segmentation feature in this blog.

A shot detection is marked at the exact frame where there is a hard cut to a different camera. The diagram below illustrates shot detection segments on a strip of film. Note that each shot is identified by a cut from one camera angle or location to the next.

Shots diagram: Seven different images layed out in sequence with each image representing a different shot in the style of a movie camera roll.

The second step in the solution uses Amazon Elasticsearch Service (Amazon ES). Amazon Elasticsearch Service is a fully managed service that makes it easy for you to deploy, secure, and run Elasticsearch cost effectively at scale. Writing metadata to Amazon ES for each clip, including the face and labels data from Amazon Rekognition, provides a searchable index of clips. You can search for cast members, visible objects in the clip or a combination of terms.

The final optional part uses AWS Elemental MediaConvert. AWS Elemental MediaConvert is a file-based video transcoding service with broadcast-grade features. It is widely used in content preparation and to create video-on-demand (VOD) content for broadcast and multiscreen delivery at scale. For this solution, you can use AWS Elemental MediaConvert to create proxies of each clip for fast browsing. AWS Elemental MediaConvert includes clipping and stitching features, enabling you to transcode just a specific segment from a longer video file to create a new clip.

This is a diagram of the high level workflow.

Workflow Diagram: Video file in S3 is processed by Amazon Rekognition. The output of segments, labels and faces is consumed by a Lambda function that creates the clip. This writes to Amazon Elasticsearch Service. AWS Elemental MediaConvert creates proxies. The user searches for clips in ES and views proxies.

We will now review how each step works in detail.

1. Detecting Segments, Labels and People

Amazon Rekognition Segment API is an asynchronous operation that can be invoked for stored videos. For this solution you can use the Amazon Rekognition Shot Detection Demo’s web interface however you could also use either the CLI or the SDK for development languages such as Java and Python. You can start shot detection with the StartSegmentDetection API call and then retrieve the results from the GetSegmentDetection API call. The Amazon Rekognition Segment API is a composite API providing both technical cues and shot detection on the same API call, you can choose in the request whether to run one or the other or both. Here is an example request for StartSegmentDetection to start shot detection, notify an SNS topic, and set minimum confidence values to 80%.

{
Video: {
S3Object: {
Bucket: “{s3BucketName}”,
Name: “{filenameAndExtension}”,
},
},
NotificationChannel: {
RoleArn: arn:aws:iam::{accountId}:role/{roleName},
SNSTopicArn: arn:aws:sns:{region}:{accountNumber}:{topicName},
},
SegmentTypes: [
'SHOT',
'TECHNICAL_CUE',
],
Filters: {
ShotFilter: {
MinSegmentConfidence: 80.0,
},
TechnicalCueFilter: {
MinSegmentConfidence: 80.0,
},
}
}

The response from StartSegmentDetection will include a JobId value which can be used to retrieve the results. Once the video analysis is complete, Amazon Rekogniton Video will publish the completion status to the SNS topic and you can then call the GetSegmentDetection API using that JobId. Here is an example request.

{
 “JobId”: “1234456789d0ea2fbc59d97cb69a72a5495da75851976b14a1784ca12345678”,
 “MaxResults”: 10,
 “NextToken”: “wxyzWXYZMOGDhzBzYUhS5puM+g1IgezqFeYpv/H/+5noP
 /LmM57FitUAwSQ5D6G4AB/PNwolrw==”
 }

The response includes Segments sections containing information about technical cues (black frames, color bars and end-credits) and shots like this cut down example response.

{
     "JobStatus": "SUCCEEDED",
     "Segments": [
        { 
            "Type": "SHOT",
            "StartTimestampMillis": 0,
            "EndTimestampMillis": 29041,
            "DurationMillis": 29041,
            "StartTimecodeSMPTE": "00:00:00:00",
            "EndTimecodeSMPTE": "00:00:29:01",
            "DurationSMPTE": "00:00:29:01",
            "ShotSegment": {
               "Index": 0, 
                "Confidence": 87.50452423095703
                }
        }
}

A key thing is the segment boundaries are frame accurate. In the response above we have the start and end times as both Timestamp (milliseconds) and as Timecode (HH:MM:SS:FF). The other Rekognition APIs provide Timestamps in milliseconds.

Let’s use the open source movie Tears of Steel as an example to illustrate the process. (IMDB: Tears of Steel). Images and videos used in this blog are courtesy of the Blender Foundation, shared under Creative Commons Attribution 3.0 (http://creativecommons.org/licenses/by/3.0/) license. Tears of Steel: (CC) Blender Foundation | mango.blender.org (https://mango.blender.org/)

Example Shot - Image from movie "Tears of Steel". Two characters arguing.

Using Segment Detection, you can find all the shots and technical cues in the video file. Let’s use shot 4 as an example, where we see two of the characters arguing on a bridge in Amsterdam. You can see the start and end times in Timestamp and Timecode formats in the JSON output:

{
'Type': 'SHOT', 
'StartTimestampMillis': 25000, 
'EndTimestampMillis': 40208, 
'DurationMillis': 15208, 
'StartTimecodeSMPTE': '00:00:25:00', 
'EndTimecodeSMPTE': '00:00:40:05', 
'DurationSMPTE': '00:00:15:05', 
'ShotSegment': {'Index': 4, 'Confidence': 99.92575073242188}
}

Next, you can detect objects using StartLabelDetection and GetLabelDetection. This returns all the labels and the timestamp (milliseconds) where they were detected and the confidence. The label detection results are reported at intervals and can be sorted by timestamp or by label. These are sorted by timestamp and the first and last are shown below:

{
  "Labels": [
    {
      "Timestamp": 9512,
      "Label": {
        "Name": "Building",
        "Confidence": 83.13668060302734,
        "Instances": [],
        "Parents": []
      }
    },
 
	…
    {
      "Timestamp": 150511,
      "Label": {
        "Name": "Human",
        "Confidence": 97.71092987060547,
        "Instances": [],
        "Parents": []
      }
    }
  ],
    "LabelModelVersion": "2.0"
}

You need to know what labels are associated with each individual shot and this is done by filtering the list of labels detected. The process looks like this

transform the JSON to create a list with every timestamp plus the labels detected at those timestamps
filter the list by timestamp to only include those within the start and end times of the shot

In the example for Shot 4, this time range is from 25000 to 40208 milliseconds. In the example these are some of the labels detected at each timestamp (milliseconds) in range, with confidence 80 or higher, with the first three and last three shown:

25512, ['Apparel', 'Banister', 'Clothing', 'Coat', 'Handrail', 'Human', 'Person', 'Railing'],
26012, ['Apparel', 'Clothing', 'Coat', 'Human', 'Jacket', 'Person'],
26512, ['Apparel', 'Clothing', 'Coat', 'Human', 'Jacket', 'Person'],
...
39011, ['Apparel', 'Banister', 'Clothing', 'Handrail', 'Human', 'Person', 'Railing'],
39511, ['Apparel', 'Banister', 'Clothing', 'Coat', 'Handrail', 'Human', 'Person', 'Railing'],
40011, ['Apparel', 'Clothing'],

Some of the labels appear at more than one timestamp within the shot so we can refine this to produce a set of unique labels for the shot:

['Apparel', 'Banister', 'Clothing', 'Coat', 'Handrail', 'Human', 'Jacket', 'Person', 'Railing']

To recognize the cast members you first need to create a Rekognition Face Collection and index their images before you can attempt to detect them. For this scene, you need to add images of the actors Denise Rebergen and Vanja Rukovina. Once your collection is updated you can call StartFaceSearch and GetFaceSearch to find them in the video. You can also combine Celebrity and face search to build a “Who’s Who” for your media content.

You can filter the face results using the timestamps (milliseconds) within the start and end timestamps of the scene to detect the actors. Only part of the JSON result is shown:

...
{'Timestamp':28250 ... 'ExternalImageId':'Denise_Rebergen','Confidence': 100.0}
{'Timestamp':28250 ... 'ExternalImageId':'Vanja_Rukovina','Confidence': 100.0}
{'Timestamp':28750 ... 'ExternalImageId':'Denise_Rebergen','Confidence': 100.0}
{'Timestamp':28750 ... 'ExternalImageId':'Vanja_Rukovina','Confidence': 100.0}
...

This is filtered by timestamp to become:

...

28250, [‘Denise_Rebergen’,’Vanja_Rukovina’]

28750, [‘Denise_Rebergen’,’Vanja_Rukovina’]

...

Finding the unique IDs detected for this clip we have:

[‘Denise_Rebergen’,’Vanja_Rukovina’]

2. Indexing the Data

Now you have the start and end time, the labels and the people identified in the shot. The next step is to combine the metadata that describes the example clip and index it in Amazon ES. Each clip is indexed as a separate document with its own ID. In this application it seems more intuitive to search for actors rather than externalimageid, so the faces detected are indexed as ‘actors’. You could even extend this to include the dialog from Amazon Transcribe, allowing you to search for particular lines or phrases as well as actors and labels. For the metadata below, Amazon Rekognition provides all the inputs except type and title, which you can supply per your video titling and labeling policy:

es.index(index="clips", doc_type="doc", id="111111", body=document)

document = {
"type": "clip",
"shot_index":"4",
"title": "Tears of Steel",
"start_time_ms": "25000",
"end_time_ms": "40208",
"start_time_timecode": "00:00:25:00",
"end_time_timecode": "00:00:40:05",
"actors": ["Denise_Rebergen", "Vanja_Rukavina"],
"labels": ['Apparel', 'Banister', 'Clothing', 'Coat', 'Handrail', 'Human', 'Jacket', 'Person', 'Railing' ]
}

Once you have clips in the index you can search for just those clips with both the characters Celia and Thom. The ability to search for multiple terms elevates the search capability and helps you narrow down your results quickly.

"query": {
    "bool": {
      "must": {
          "query_string": {
                "default_field": "actors",
                "query": "Denise_Rebergen"}
              },
      "filter": {
          "query_string": {
                "default_field": "actors",
                "query": "Vanja_Rukavina"}
          }
      }
}})

And your search will return all the clips with the highest ranked match first. You could also search for a combination of terms such as an actor and a label. There are many ways to search your index and Amazon ES supports a variety of search terms and options.

3. Create Individual Proxy Video Clips

The last part is creating proxies for browsing. Alongside your archive you can create a proxy for each clip where you can retrieve just the clip you want to view quickly without retrieving the whole asset. AWS Elemental MediaConvert supports input clipping and stitching and you can pass each clip start and end time to create a new proxy for each clip. To keep the strorage size of the proxy small, you should select a lower resolution and bit rate than the master, using QVBR for maximimum efficiency. Proxies are often 1/4 or 1/16 of the original resolution, so for example, a 1920×1080 asset could be transcoded to a 960×540 or 480×270 proxy.

The input uses Timecode for frame accurate clipping so you enter the start and end times as HH:MM:SS:FF into the “InputClippings” element in the “Inputs” section:

{"Inputs": [ ... "FileInput": "s3://my-bucketname/filename.mp4", "InputClippings":[ {"StartTimecode":"00:00:25:00","EndTimecode":"00:00:40:05"}], ... ]}

Bonus Content: Creating an Edit Decision List

In day to day operations, as soon as you have successfully found the right clip, you will want to edit the full res copy. To make this simpler you can transform the output of the Segment Detection API into an Edit Decision List (EDL).

An EDL is a file containing a list of Timecodes and are used in the video editing process with software tools such as Adobe Premiere Pro or DaVinci Resolve. The timecodes in the EDL are used to break a large video into smaller segments and decide which segments to use, plus they can be shared with other editors working on the same project even if they are using a different editing software. Shot 004 is highlighted:

TITLE: TOS 4K 1920
FCM: NON-DROP FRAME

001 SHOT000 V C 00:00:00:00 00:00:08:21 00:00:00:00 00:00:08:21
FROM CLIP NAME: ToS-4k-1920.mov
002 SHOT001 V C 00:00:08:22 00:00:13:10 00:00:08:21 00:00:13:10
FROM CLIP NAME: ToS-4k-1920.mov
003 SHOT002 V C 00:00:13:11 00:00:18:09 00:00:13:10 00:00:18:09
FROM CLIP NAME: ToS-4k-1920.mov
004 SHOT003 V C 00:00:18:10 00:00:24:23 00:00:18:09 00:00:24:23
FROM CLIP NAME: ToS-4k-1920.mov
005 SHOT004 V C 00:00:25:00 00:00:40:05 00:00:24:23 00:00:40:05
FROM CLIP NAME: ToS-4k-1920.mov
...

The Amazon Rekognition Shot Detection Demo includes short demo videos on how to import the EDL into popular video editing tools.

Video showing editing workflow starting with Shot Detection Demo then downloading Edit Decision List. Import file into editing software and see the different shots in a timeline view

You can use the Amazon Rekognition Shot Detection Demo to follow the editing workflow from detecting segments, to downloading the Edit Decision List (EDL) and then uploading it to different video editing software packages. Deploy the AWS CloudFormation template in your account then upload a video. Once the video has uploaded you click on the video in the application and in addition to information about Shot Segments and Technical Cues there will also be a “Download EDL” button. Once downloaded you can import this file into compatible video editing software as shown in the video.

Conclusion and Next Steps

You can now create a searchable index of clips using Amazon Rekognition, Amazon ES and AWS Elemental MediaConvert. In this blog, you have learned how to:

1. Identify segments in a video file with frame accuracy using the SegmentDetection API,
2. use the timestamps to filter the metadata for each individual clip
3. index the clips
4. create individual proxies using InputClippings

As a bonus, you can now create an Edit Decision List from the same SegmentDetection response to drive your editing software.

You can add these features to enhance your existing applications, or start with Media2Cloud as a reference architecture and build your own. For a visual demonstration of segment detection in action with some of your own files, please try out this shot detection demo on github.

AWS for M&E Blog