Embed textual data in Amazon RDS for SQL Server using Amazon Bedrock

In Part 1 of this post, we covered how Retrieval Augmented Generation (RAG) can be used to enhance responses in generative AI applications by combining domain-specific information with a foundation model (FM). However, we stayed focused on the semantic search aspect of the solution, assuming that our vector store was already built and fully populated. In this post, we explore how to generate vector embeddings on Wikipedia data stored in a SQL Server database hosted on Amazon Relational Database Service (Amazon RDS). We also use Amazon Bedrock to invoke the appropriate FM APIs and an Amazon SageMaker Jupyter Notebook to help us orchestrate the overall process.

Amazon RDS for SQL Server and Amazon Bedrock

Before we explore vector embeddings, let’s discuss two key Amazon Web Services (AWS) services in this solution: Amazon RDS for SQL Server and Amazon Bedrock.

Amazon RDS for SQL Server is a fully managed database service that simplifies the setup, operation, and scaling of SQL Server databases in the cloud. By using RDS for SQL Server, we can store and manage our sample Wikipedia data, including vector embeddings, and achieve high availability, durability, and performance.

Amazon Bedrock is a fully managed service that offers a choice of industry leading foundation models (FMs) along with a broad set of capabilities that you need to build generative AI applications, simplifying development with security, privacy, and responsible AI. With the comprehensive capabilities of Amazon Bedrock, you can experiment with a variety of top FMs, customize them privately with your data using techniques such as fine-tuning and RAG, and create managed agents that execute complex business tasks—from booking travel and processing insurance claims to creating advertising campaigns and managing inventory—all without writing any code. Because Amazon Bedrock is serverless, you don’t have to manage any infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services that you’re already familiar with. It supports a wide range of data sources, including RDS for SQL Server instances, and provides built-in capabilities for generating vector embeddings using state-of-the-art language models supported on AWS.

Integrate RDS for SQL Server and Amazon Bedrock

The first step is to establish a connection between the RDS for SQL Server instance and Amazon Bedrock. This integration facilitates access to the Wikipedia data stored in the RDS for SQL Server database using the capabilities of Amazon Bedrock, making subsequent analysis straightforward.

Solution overview

Tokenizing and vectorizing content data into vector embeddings can be accomplished through various approaches. Regardless of the vector data store employed, it’s the responsibility of an individual such as a data engineer, database developer, or data scientist to convert the original data into a compatible vector format.

In this example, we have a modest dataset sourced from Wikipedia, stored in a SQL Server database table consisting of 25,000 rows. This table comprises columns labeled id, title, contents, and url. We want to incorporate a new column named contents_vector into the existing table to facilitate vector similarity searches. We run the code in Jupyter Notebook so you can see step-by-step instructions and the corresponding results from each step.

The following diagram shows the flow of steps in the solution.

The flow of the steps is as follows:

Get raw data from RDS for SQL Server
Generate vector embeddings of the raw data using Amazon Bedrock through Amazon SageMaker
Store the embeddings into RDS for SQL Server
Create a vector table and columnstore index
Generate embeddings from the prompts
Perform similarity search on RDS for SQL Server

Prerequisites

This post assumes that you’re familiar with navigating the AWS Management Console. For this example, you need the following resources and services enabled in your AWS account:

Amazon RDS for SQL Server instance (vector data store)
Amazon SageMaker sample notebook
Amazon Bedrock Python library
Amazon Bedrock. You need to request access to use specific FMs in Amazon Bedrock. In this post, we use Amazon Titan Embeddings G1 – Text.
Set up Amazon Bedrock API

To use RDS for SQL Server as a vector data store to implement a generative AI use case that involves similarity search, follow the steps in Part 1. Data embedding takes approximately 90 minutes to complete.

Generate vector embeddings

Follow these steps to generate vector embeddings and perform a similarity search on RDS for SQL Server:

In the SageMaker console, when the notebook status is In Service, choose Open JupyterLab. You will be redirected to the Jupyter Lab IDE, as shown in the following screenshot. Select the Terminal

Run the following code in the Linux terminal window to install the Microsoft Open Database Connectivity (ODBC) driver for SQL Server (Linux).

# RHEL 7 and Oracle Linux 7
curl https://packages.microsoft.com/config/rhel/7/prod.repo | sudo tee /etc/yum.repos.d/mssql-release.repo
sudo yum remove unixODBC-utf16 unixODBC-utf16-devel #to avoid conflicts
sudo ACCEPT_EULA=Y yum install -y msodbcsql18
# Optional: for bcp and sqlcmd
sudo ACCEPT_EULA=Y yum install -y mssql-tools18
echo 'export PATH="$PATH:/opt/mssql-tools18/bin"' >> ~/.bashrc
source ~/.bashrc
# For unixODBC development headers
sudo yum install -y unixODBC-devel

Upload the SageMaker notebook file provided in the prerequisites section (ipynb). Choose the Open JupyterLab link at the right of your SageMaker notebook instance. Choose the folder icon located in the navigation pane of your JupyterLab window. Select the upload files icon, browse to the Downloads folder, and select the file to upload. After the file is uploaded, you can continue to the embedding process.

Install the prerequisites using the following pip

! pip install --no-build-isolation --force-reinstall \
"boto3>=1.33.6" \
"awscli>=1.31.6" \
"botocore>=1.33.6"

Import the necessary modules.

import json
import os
import sys
import boto3
import botocore
import pandas as pd
import pyodbc
import time
import warnings

Assign the start time and the Amazon Bedrock runtime variable. Start time is used to calculate total time taken towards the end to perform embedding.

start_time = time.time()
bedrock_runtime = boto3.client(service_name="bedrock-runtime")

In the Wikipedia article source data provided in the prerequisites section, select all article-related data into a DataFrame, then apply the embedding algorithm to some of the columns.

from warnings import filterwarnings
filterwarnings("ignore", category=UserWarning, message='.*pandas only supports SQLAlchemy connectable.*')
cnxn = pyodbc.connect(f'DRIVER=ODBC Driver 18 for SQL Server;SERVER=<rds_endpoint>;PORT=1433;DATABASE=vector_db_wiki_v3;UID=<master user>;PWD=<password>;Encrypt=no')

Read the data from Amazon RDS.

sql_query = pd.read_sql("SELECT  [id] ,[url],[title] ,[text] FROM [vector_db_wiki_v3].[dbo].[wikipedia_articles]", cnxn)

Convert the data into a DataFrame format.

df = pd.DataFrame(sql_query, columns=['id', 'url', 'title', 'text', 'content', 'content_id'])

Remove all single quotation marks from the column as part of the data cleansing process. This is to minimize syntax errors when you form the INSERT command later.

df['title'] = df['title'].astype(str).str.replace("[']", "", regex=True)
df['text'] = df['text'].astype(str).str.replace("[']", "", regex=True)
df['url'] = df['url'].astype(str).str.replace("[']", "", regex=True)

Fill the NaN column to 0 and set the content_id to integer type.

df = df.fillna(0)
df['content_id'] = df['content_id'].astype('int')

The next section in the notebook embeds the data by calling Amazon Titan Text Embeddings. The execution takes approximately 90 minutes to be completed. You can see the progress of the embedding as the code is executed.

warnings.simplefilter(action='ignore', category=FutureWarning)
for i in range(len(df)):
# For amazon titan-embed-text-v1 model, the maximum context length is 8192 tokens.
# One token holds about 3 or 4 characters.
if df.loc[i, 'text_length'] > 20000:  # It is up to you to choose your preference.
print("Skip this article. The text length exceeded 20000, the row index is: " + str(i) + ". The process continues ...")
else:
if (i > 0) and (i % 1000) == 0:
print("Sleep for 60 seconds")
time.sleep(60)  # sleep for 60 seconds to avoid the limit of max tokens processed per minute 300,000 of amazon.titan-embed-text-v1
else:
input_data = df.loc[i, 'text']
# print (input_data)
body = json.dumps({"inputText": input_data})
modelId = "amazon.titan-embed-text-v1"
# modelId = "amazon.titan-embed-text-v2:0" # amazon.titan-embed-text-v2:0  # amazon.titan-embed-text-v1
accept = "application/json"
contentType = "application/json"
response = bedrock_runtime.invoke_model(
body=body, modelId=modelId, accept=accept, contentType=contentType
)

response_body = json.loads(response.get("body").read())
embedding = response_body.get("embedding")
# print(f"The embedding vector has {len(embedding)} values\n{embedding[0:3]+['...']+embedding[-3:]}")
json_embedding_result = json.dumps(embedding)
df.loc[i, 'content'] = json_embedding_result
df.loc[i, 'content_id'] = i

The progress of the embedding will look similar to the following figure. As shown in the figure, the script skips articles that have a text length of more than 20,000.

Convert the DataFrame data into a CSV file, which will be stored locally in SageMaker.

df.to_csv(r'wikipedia_articles_content_embedded_based_on_Amazon_Bedrock.csv', index=False, sep='\t', encoding='utf-8', header='true')

Remove all single quotation marks in the file.

my_new_df = pd.read_csv(r'wikipedia_articles_content_embedded_based_on_Amazon_Bedrock.csv', sep='\t', encoding='utf-8')

Delete rows where the content and content_id columns are equal to 0.0 and 0.

my_new_df_clean = my_new_df.drop(my_new_df[(my_new_df['content'] == "0.0") &amp; (my_new_df['content_id'] == 0)].index)

Match the columns in the Pandas frame to the columns in the table and rename the Pandas DataFrame columns in place.

column_map = {
'id': 'id',
'url': 'url',
'title': 'title',
'text': 'text',
'content': 'content_vector',
'content_id': 'vector_id',
'text_length': 'text_length',
}

# Rename pandas dataframe columns
my_new_df_clean.rename(columns=column_map, inplace=True)

You can use the following command to display the cleansed DataFrame. We show only one record instead of all.

df_final = my_new_df_clean[['id', 'url', 'title', 'text', 'content_vector', 'vector_id', 'text_length']].copy()
display(df_final[:1])

Instead of inserting the embeddings back into the same table, in this example, we created a new table using the following command.

cursor = cnxn.cursor()
sql_stm = '''
drop table if exists dbo.wikipedia_articles_embedding_bedrock;
CREATE TABLE [dbo].[wikipedia_articles_embedding_bedrock](
[id] [int] NOT NULL,
[url] [varchar](1000) NOT NULL,
[title] [varchar](1000) NOT NULL,
[text] [varchar](max) NOT NULL,
[content_vector] [varchar](max) NOT NULL,
[vector_id] [int] NOT NULL,
text_length int
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];
'''
cursor.execute(sql_stm)
cnxn.commit()

Prepare the data for insertion and insert the data into the new table by forming an INSERT statement.

# Prepare data for insertion
data = df_final.to_dict(orient='records')

# Insert data into the database table. Btw this for loop will take few minutes.
with cnxn.cursor() as cursor:

for row in data:
values = ', '.join(f"'{value}'" for value in row.values())
columns = ', '.join(row.keys())
query = f"INSERT INTO wikipedia_articles_embedding_bedrock ({columns}) VALUES ({values})"
#print(query)
cursor.execute(query)
cnxn.commit()

Create a vector table and a columnstore index for efficient vector similarity search.

cursor = cnxn.cursor()
sql_stm = '''
drop table if exists dbo.wikipedia_articles_content_vector;
with cte as
(
select
v.id as article_id,
cast(tv.[key] as int) as vector_value_id,
cast(tv.[value] as float) as vector_value
from
[dbo].[wikipedia_articles_embedding_bedrock] as v
cross apply
openjson(content_vector) tv
)
select
article_id,
vector_value_id,
vector_value
into
dbo.wikipedia_articles_content_vector
from
cte;
create clustered columnstore index ixc
on dbo.wikipedia_articles_content_vector;
'''
cursor.execute(sql_stm)
cnxn.commit()

To test the vector similarity search, create a prompt to input a search string to search for the keyword, Warrior.

prompt_data = "Warrior"

Using the prompt, connect to Amazon Bedrock to convert the string into an embedding using Amazon Titan Text Embeddings.

body = json.dumps({"inputText": prompt_data})

# Connect to a Amazon Bedrock embedding model
modelId = "amazon.titan-embed-text-v1"
#modelId = "amazon.titan-embed-text-v2:0" # amazon.titan-embed-text-v2:0  # amazon.titan-embed-text-v1
accept = "application/json"
contentType = "application/json"
response = bedrock_runtime.invoke_model(
body=body, modelId=modelId, accept=accept, contentType=contentType
)

response_body = json.loads(response.get("body").read())
embedding = response_body.get("embedding")

Convert the text embedding into a JSON string.

json_query_result = json.dumps(embedding)

Connect to RDS for SQL Server to perform the similarity search and display the result.

# Connect to a SQL Server database (RDS for SQL Server)
cnxn = pyodbc.connect(f'DRIVER=ODBC Driver 18 for SQL Server; SERVER=<rds_endpoint>;PORT=1433;DATABASE=vector_db_wiki_v3;UID=<master user>;PWD=<password>;Encrypt=no')

# Perform a Vector Similarity search here:
cursor = cnxn.cursor()
print(cursor.execute("select top 20 id,title,cosine_distance,url \
from dbo.Bedrock_SimilarContentArticles(?) as r order by cosine_distance desc", json_query_result))    # --> Passing the user's embedding (json type) into the UDF
results = cursor.fetchall()
for row in results:
print(*row)

We recognize the limitations of the code, which is neither perfect nor efficient, but it serves the purpose of illustrating potential challenges that might arise during even basic data cleansing and transformation tasks. This code snippet serves as a quick test that uses an Amazon Bedrock embedding model to vectorize 25,000 Wikipedia articles and validates a vector similarity search within RDS for SQL Server. The execution time will vary based on your computing instance. The entire process, from vectorizing 25,000 Wikipedia articles to inserting vector data into SQL Server tables and creating columnar indexes, might take approximately 2 hours. We use the Amazon Bedrock embed function to generate embeddings for the content column of the data.

You can choose from various pre-trained models such as bert-base-uncased or roberta-base, but be aware that these haven’t been tested for this solution. The embed function uses the power of pre-trained language models like bert-base-uncased and roberta-base to generate dense vector representations of your textual data. These embeddings capture the semantic relationships between words, phrases, and concepts, and with them, you can perform a wide range of analyses and tasks.

Cleanup

Running this solution created a few AWS resources:

An RDS for SQL Server database instance
An Amazon SageMaker Notebook instance

If you don’t need these resources going forward, delete the RDS for SQL Server and SageMaker Notebook instances to avoid unnecessary charges by following the instructions in the provided URLs.

Conclusion

This post has provided a comprehensive overview of how to generate vector embeddings, from setting up the environment and generating embeddings to exploring their applications and advanced techniques. As the field of natural language processing continues to evolve, new techniques and models will emerge, further enhancing the ability to extract insights and knowledge from textual data. With the right tools and techniques, you can use this vast repository of knowledge to discover actionable insights that drive innovation and effective decision-making across various domains. Try out this solution and if you have any comments or questions, leave them in the comments section.

About the Authors

Joshua Jin is a database benchmark engineer at Amazon Web Services (AWS), where he specializes in meticulously evaluating the performance of databases.

Sudarshan Roy is a Senior Database Specialist Cloud Solution Architect with World Wide AWS Database Services Organization (WWSO). He has led large scale database migration and modernization engagements for enterprise customers and is passionate about solving complex migration challenges while moving database workloads to the AWS Cloud.

Barry Ooi is a Senior Database Specialist Solution Architect at AWS. His expertise is in designing, building and implementing data platforms cloud native services for customers as part of their journey on AWS. His areas of interest include data analytics and visualization.

AWS Database Blog