AWS Machine Learning Blog
Easily build semantic image search using Amazon Titan
Digital publishers are continuously looking for ways to streamline and automate their media workflows to generate and publish new content as rapidly as they can, but without foregoing quality.
Adding images to capture the essence of text can improve the reading experience. Machine learning techniques can help you discover such images. “A striking image is one of the most effective ways to capture audiences’ attention and create engagement with your story—but it also has to make sense.”
The previous post discussed how you can use AWS machine learning (ML) services to help you find the best images to be placed along an article or TV synopsis without typing in keywords. In the previous post, you used Amazon Rekognition to extract metadata from an image. You then used a text embedding model to generate a word embedding of the metadata that could be used later to help find the best images.
In this post, you see how you can use Amazon Titan foundation models to quickly understand an article and find the best images to accompany it. This time, you generate the embedding directly from the image.
A key concept in semantic search is embeddings. An embedding is a numerical representation of some input—an image, text, or both—in the form of a vector. When you have many vectors, you can measure the distance between them, and vectors that are close in distance are semantically similar or related.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies including AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon with a single API, along with a broad set of capabilities to help you build generative AI applications, simplifying development while maintaining privacy and security.
Amazon Titan has recently added a new embedding model to its collection, Titan Multimodal Embeddings. This new model can be used for multimodal search, recommendation systems, and other downstream applications.
Multimodal models can understand and analyze data in multiple modalities such as text, image, video, and audio. This new Amazon Titan model can accept text, images, or both. This means you use the same model to generate embeddings of images and text and use those embeddings to calculate how similar the two are.
Overview of the solution
In the following screenshot, you can see how you can take a mini article, perform a search, and find images that resonate with the article. In this example, you take a sentence that describes Werner Vogels wearing white scarfs while travelling around India. The vector of the sentence is semantically related to the vectors of the images of Werner wearing a scarf, and hence returned as the top images in this search.
At a high level, an image is uploaded to Amazon Simple Storage Service (Amazon S3) and the metadata is extracted including the embedding of the image.
To extract textual metadata from the image, you use the celebrity recognition feature and the label detection feature in Amazon Rekognition. Amazon Rekognition automatically recognizes tens of thousands of well-known personalities in images and videos using ML. You use this feature to recognize any celebrities in the images and store this metadata in Amazon OpenSearch Service. Label detection finds objects and concepts from the image, such as the preceding screenshot where you have the label metadata below the image.
You use the Titan Multimodal Embeddings model to generate an embedding of the image which is also searchable metadata.
All the metadata is then stored in OpenSearch Service for later search queries when you need to find an image or images.
The second part of the architecture is to submit an article to find these newly ingested images.
When the article is submitted, you need to extract and transform the article into a search input for OpenSearch Service. You use Amazon Comprehend to detect any names in the text that could be potential celebrities. You summarize the article as you will likely be picking only one or two images to capture the essence of the article. Generating a summary of the text is a good way to make sure that the embedding is capturing the pertinent points of the story. For this, you use the Amazon Titan Text G1 – Express model with a prompt such as “Please provide a summary of the following text. Do not add any information that is not mentioned in the text below.” With the summarized article, you use the Amazon Titan Multimodal Embeddings model to generate an embedding of the summarized article. The embedding model also has a 128 maximum token input count, therefore summarizing the article is even more important to make sure that you can get as much information captured in the embedding as possible. In simple terms, a token is a single word, sub-word, or character.
You then perform a search against OpenSearch Service with the names and the embedding from the article to retrieve images that are semantically similar with the presence of the given celebrity, if present.
As a user, you’re just searching for images using an article as the input.
Walkthrough
The following diagram shows you the architecture to deliver this use-case.
The following steps talk through the sequence of actions (depicted in the diagram) that enable semantic image and celebrity search.
- You upload an image to an Amazon S3 bucket.
- Amazon EventBridge listens to this event, and then initiates an AWS Step Functions step.
- The Step Functions step takes the Amazon S3 image details and runs three parallel actions:
- An API call to Amazon Rekognition DetectLabels to extract object metadata
- An API call to Amazon Rekognition RecognizeCelebrities APIs to extract any known celebrities
- An AWS Lambda function resizes the image to accepted maximum dimensions for the ML embedding model and generates an embedding direct from the image input.
- The Lambda function then inserts the image object metadata and celebrity names if present, and the embedding as a k-NN vector into an OpenSearch Service index.
- Amazon S3 hosts a simple static website, distributed by an Amazon CloudFront. The front-end user interface (UI) allows you to authenticate with the application using Amazon Cognito to search for images.
- You submit an article or some text using the UI.
- Another Lambda function calls Amazon Comprehend to detect any names in the text as potential celebrities.
- The function then summarizes the text to get the pertinent points from the article using Titan Text G1 – Express.
- The function generates an embedding of the summarized article using the Amazon Titan Multimodal Embeddings model.
- The function then searches the OpenSearch Service image index for images matching the celebrity name and the k-nearest neighbors for the vector using cosine similarity using Exact k-NN with scoring script.
- Amazon CloudWatch and AWS X-Ray give you observability into the end-to-end workflow to alert you of any issues.
The following figure shows you the visual workflow designer of the Step Functions workflow.
Here’s an example of an embedding:
The preceding array of numbers is what captures meaning from the text or image object in a form that you can perform calculations and functions against.
Embeddings have high dimensionality from a few hundred to many thousands of dimensions. This model has a default dimensionality of 1,024, that is, the preceding array will have 1,024 elements to it that capture the semantics of the given object.
Multimodal embedding versus text embedding
We discuss two options in delivering semantic image search where the main difference is how you generate the embeddings of the images. In our previous post, you generate an embedding from the textual metadata, which is extracted using Amazon Rekognition. In this post, you use the Titan Multimodal Embeddings model, and can generate an embedding of the image directly.
Doing a quick test and running a query in the UI against the two approaches, you can see the results are noticeably different. The example query article is “Werner Vogels loves wearing white scarfs as he travels around India.”
The result from the multimodal model scores the images with a scarf present higher. The word scarf is present in our submitted article, and the embedding has recognized that.
In the UI, you can see the metadata extracted by Amazon Rekognition, and the metadata doesn’t include the word scarf and therefore has missed some information from the image, which you can assume the image embedding model has not, and therefore the multimodal model might have an advantage depending on the use case. Using Amazon Rekognition, you can filter the objects detected in the image before creating an embedding, and therefore have other applicable use cases that might work better depending on your desired outcome.
The following figure shows the results from the Amazon Titan Text Embeddings model.
The following figure shows the results from the Amazon Titan text embedding model using the Amazon Rekognition extracted metadata to generate the embedding.
Prerequisites
For this walkthrough, you must have the following prerequisites:
- An AWS account
- AWS Serverless Application Model Command Line Interface (AWS SAM CLI)
- The solution uses the AWS SAM CLI for deployment.
- Make sure that you’re using latest version of AWS SAM CLI.
- Docker
- The solution uses the AWS SAM CLI option to build inside a container to avoid the need for local dependencies. You need Docker for this.
- Node
- The front end for this solution is a React web application that can be run locally using Node.
- npm
- The installation of the packages required to run the web application locally, or build it for remote deployment, require npm.
Build and deploy the full stack application
- Clone the repository
- Change directory into the newly cloned project.
- Run npm install to download all the packages required to run the application.
- Run a deploy script that runs a series of scripts in sequence that will do a sam build, sam deploy, update configuration files, and then host the web application files in Amazon S3 ready for serving through Amazon CloudFront
- One of the final outputs from the script is an Amazon CloudFront URL, which is how you will access the application. You must create a new user in the AWS Management Console to sign in with. Make a note of the URL to use later.
The following screenshot shows how the script has used AWS SAM to deploy your stack and has output an Amazon CloudFront URL you can use to access the application.
Create a new user to sign in to the application
- Go to the Amazon Cognito console and select your new User pool.
- Create a new user with a new password.
Sign in to and test the web application
- Find the Amazon CloudFront URL to get to the sign in page. This is output in the final line as shown in the preceding screenshot.
- Enter your new username and password combination to sign in.
- Upload some sample images using the UI.
- Choose Choose file and then choose Upload.
Note: You can also upload directly to the S3 bucket in bulk by adding files to the /uploads folder. - Write or copy and paste an article and choose Submit to see if the images are returned by order expected.
- Choose Choose file and then choose Upload.
Cleaning up
To avoid incurring unintended charges, delete the resources.
- Find the S3 bucket deployed with this solution and empty the bucket.
- Go to the CloudFormation console, choose the stack that you deployed through the deploy script mentioned previously, and delete the stack.
Conclusion
In this post, you saw how to use Amazon Rekognition, Amazon Comprehend, Amazon Bedrock, and OpenSearch Service to extract metadata from your images and then use ML techniques to automatically discover closely related content using celebrity and semantic search. This is particularly important within the publishing industry, where speed matters in getting fresh content out quickly and to multiple platforms.
As a next step, deploy the solution in your AWS account and upload some of your own images for testing how semantic search can work for you. Let me know some of your feedback in the comments below.
About the Authors
Mark Watkins is a Solutions Architect within the Media and Entertainment team, supporting his customers solve many data and ML problems. Away from professional life, he loves spending time with his family and watching his two little ones growing up.
Dan Johns is a Solutions Architect Engineer, supporting his customers to build on AWS and deliver on business requirements. Away from professional life, he loves reading, spending time with his family and automating tasks within their home.