Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Extracting valuable insights from massive datasets is essential for businesses striving to gain a competitive edge. Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena, Amazon Redshift, Amazon EMR, and so on. Amazon Athena provides interactive analytics service for analyzing the data in Amazon Simple Storage Service (Amazon S3). Amazon Redshift is used to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. Amazon EMR provides a big data environment for data processing, interactive analysis, and machine learning using open source frameworks such as Apache Spark, Apache Hive, and Presto. These data processing and analytical services support Structured Query Language (SQL) to interact with the data.

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Large language model (LLM)-based generative AI is a new technology trend for comprehending a large corpora of information and assisting with complex tasks. Can it also help write SQL queries? The answer is yes.

Generative AI models can translate natural language questions into valid SQL queries, a capability known as text-to-SQL generation. Although LLMs can generate syntactically correct SQL queries, they still need the table metadata for writing accurate SQL query. In this post, we demonstrate the critical role of metadata in text-to-SQL generation through an example implemented for Amazon Athena using Amazon Bedrock. We discuss the challenges in maintaining the metadata as well as ways to overcome those challenges and enrich the metadata.

Solution overview

This post demonstrates text-to-SQL generation for Athena using an example implemented using Amazon Bedrock. We use Anthropic’s Claude 2.1 foundation model (FM) in Amazon Bedrock as the LLM. Amazon Bedrock models are invoked using Amazon SageMaker. Working examples are designed to demonstrate how various details included in the metadata influences the SQL generated by the model. These examples use synthetic datasets created in AWS Glue and Amazon S3. After we review the significance of these metadata details, we’ll delve into the challenges encountered in gathering the required level of metadata. Subsequently, we’ll explore strategies for overcoming these challenges.

The examples implemented in the workflow are illustrated in the following diagram.

Figure 1. The solution architecture and workflow.

The workflow follows the following sequence:

A user asks a text-based question which can be answered by querying relevant AWS Glue tables through Athena.
Table metadata is fetched from AWS Glue.
The tables’ metadata and SQL generating instructions are added to the prompt template. The Claude AI model is invoked by passing the prompt and the model parameters.
The Claude AI model translates the user intent (question) to SQL based on the instructions and tables’ metadata.
The generated Athena SQL query is run.
The generated Athena SQL query and the SQL query results are returned to the user.

Prerequisites

These prerequisites are given If you want to try this example yourself. You can skip this prerequisites section if you want to understand the example without implementing it. The example centers on invoking Amazon Bedrock models using SageMaker, so we need to set up a few resources in an AWS Account. The relevant CloudFormation template, Jupyter Notebooks, and details of launching the necessary AWS services are covered in this section. The CloudFormation template creates the SageMaker instance with the necessary S3 bucket and IAM role permissions to run AWS Glue commands, Athena SQL, and invoke Amazon Bedrock AI models. The two Jupyter Notebooks (0_create_tables_with_metadata.ipynb and 1_text-to-sql-for-athena.ipynb) provide working code snippets to create the necessary tables and generate the SQL using the Claude AI model on Amazon Bedrock.

Granting Anthropic’s Claude permissions on Amazon Bedrock

Have an AWS account and sign in using the AWS Management Console.
Change the AWS Region to US West (Oregon).
Navigate to the AWS Service Catalog console and choose Amazon Bedrock.
On the Amazon Bedrock console, choose Model Access in the navigation pane.
Choose Manage model access.
Select the Claude
Choose Request model access if you’re requesting the model access for the first time. Otherwise choose Save Changes.

Deploying the CloudFormation stack

After launching the CloudFormation stack:

On the Create stack page, choose Next
On the Specify stack details page, choose Next
On the Configure stack options page, choose Next
On the Review and create page, select I acknowledge that AWS CloudFormation might create IAM resources
Choose Submit

Downloading Jupyter Notebooks to SageMaker

In the AWS Management Console, choose the name of the currently displayed Region and change it to US West (Oregon).
Navigate to the AWS Service Catalog console and choose Amazon SageMaker.
On the Amazon SageMaker console, choose Notebook in the navigation pane.
Choose Notebook instances.
Select the SageMakerNotebookInstance created by the texttosqlmetadata CloudFormation stack.
Under Actions, choose Open Jupyter
Navigate to Jupyter console, select New, and then choose Console.

Run the following Shell script commands in the console to copy the Jupyter Notebooks.

cd /home/ec2-user/SageMaker
BASE_S3_PATH="s3://aws-blogs-artifacts-public/artifacts/BDB-4265"
aws s3 cp "${BASE_S3_PATH}/0_create_tables_with_metadata.ipynb" ./
aws s3 cp "${BASE_S3_PATH}/1_text_to_sql_for_athena.ipynb" ./

Open each downloaded Notebook and update the values of the athena_results_bucket, aws_region, and athena_workgroup variables based on the outputs from the texttosqlmetadata CloudFormation

Solution implementation

If you want to try this example yourself, try the CloudFormation template provided in the previous section. In the subsequent sections, we will illustrate how each element of the metadata included in the prompt influences the SQL query generated by the model.

The steps in the 0_create_tables_with_metadata.ipynb Jupyter Notebook create Amazon S3 files with synthetic data for employee and department datasets, creates employee_dtls and department_dtls Glue tables pointing to those S3 buckets, and extracts the following metadata for these two tables.

CREATE EXTERNAL TABLE employee_dtls (
	id int COMMENT 'Employee id',
	name string COMMENT 'Employee name',
	age int COMMENT 'Employee age',
	dept_id int COMMENT 'Employee Departments ID',
	emp_category string COMMENT 'Employee category. Contains TEMP For temporary, PERM for permanent, CONTR for contractors ',
	location_id int COMMENT 'Location identifier of the Employee',
	joining_date date COMMENT 'Joining date of the Employee',
	CONSTRAINT pk_1 PRIMARY KEY  (id) ,
	CONSTRAINT FK_1 FOREIGN KEY (dept_id) REFERENCES department_dtls(id)
) 
PARTITIONED BY (
	region_id string COMMENT 'Region identifier. Contains AMER for Americas, EMEA for Europe, the Middle East, and Africa, APAC for Asia Pacific countries'
);

CREATE EXTERNAL TABLE department_dtls (
	id int COMMENT 'Department id',
	name string COMMENT 'Department name',
	location_id int COMMENT 'Location identifier of the Department'
)

The metadata extracted in the previous step provides column descriptions. For the region_id partition column and emp_category column, the description provides possible values along with their meaning. The metadata also has foreign key constraint details. AWS Glue doesn’t provide a way to specify the primary key and foreign key constraints, so use custom keys in the AWS Glue table-level parameters as an alternative to gather primary key and foreign keys while creating the AWS Glue table.

# Define the table schema
employee_table_input = {
    'Name': employee_table_name,
    'PartitionKeys': [
        {'Name': 'region_id', 'Type': 'string', 'Comment': 'Region identifier. Contains AMER for Americas, EMEA for Europe, the Middle East, and Africa, APAC for Asia Pacific countries'}
    ],
    'StorageDescriptor': {
        'Columns': [
            {'Name': 'id', 'Type': 'int', 'Comment': 'Employee id'},
       …
        ],
        'Location': employee_s3_path,
     …
    'TableType': 'EXTERNAL_TABLE',
    'Parameters': {
        'classification': 'csv',
        'primary_key': 'CONSTRAINT pk_1 PRIMARY KEY  (id)',
        'foreign_key_1': 'CONSTRAINT FK_1 FOREIGN KEY (dept_id) REFERENCES department_dtls(id)'          
    }
}

# Create the table
response = glue_client.create_table(DatabaseName=database_name, TableInput=employee_table_input)

The steps in the 1_text-to-sql-for-athena.ipynb Jupyter notebook create the following wrapper function to interact with Claude FM on Amazon Bedrock to generate SQL based on user-provided text wrapped up in a prompt. This function hard codes the model’s parameters and model ID for demonstrating the basic functionality.

def interactWithClaude(prompt):

    body = json.dumps(
        {
            "prompt": prompt,
            "max_tokens_to_sample": 2048,
            "temperature": 1,
            "top_k": 250,
            "top_p": 0.999,
            "stop_sequences": [],
        }
    )
    modelId = "anthropic.claude-v2"  
    accept = "application/json"
    contentType = "application/json"
    response = bedrock_client.invoke_model(
        body=body, modelId=modelId, accept=accept, contentType=contentType
    )
    response_body = json.loads(response.get("body").read())
    response_text_claude = response_body.get("completion")
    return response_text_claude

Define the following set of instructions for generating Athena SQL query. These SQL generating instructions specify which compute engine the SQL query should run on and other instructions to guide the model in generating the SQL query. These instructions are included in the prompt sent to the Bedrock model.

athena_sql_generating_instructions = """
Read database schema inside the <database_schema></database_schema> tags which contains a list of table names and their schemas to do the following:
    1. Create a syntactically correct AWS Athena query to answer the question.
    2. For tables with partitions, include the filters on the relevant partition columns.
    3. Include only relevant columns for the given question.
    4. Use only the column names that are listed in the schema description. 
    5. Qualify column names with the table name.
    6. Avoid joins to a table if there is no column required from the table.
    7. Convert Strings to Date type while filtering on Date type columns
    8. Return the sql query inside the <SQL></SQL> tab.
"""

Define different prompt templates for demonstrating the importance of metadata in text-to-SQL generation. These templates have placeholders for SQL query generating instructions and tables metadata.

athena_prompt1 = """
Human:  You are an AWS Athena query expert whose output is a valid sql query. You are given the following Instructions for building the AWS Athena query.
<Instructions>
{instruction_dtls}
</Instructions>
        
Only use the following tables defined within the database_schema and table_schema XML-style tags:

<database_schema>
<table_schema>
CREATE EXTERNAL TABLE employee_dtls (
  id int,
  name string,
  age int ,
  dept_id int,
  emp_category string ,
  location_id int ,
  joining_date date
) PARTITIONED BY (
  region_id string
  )
</table_schema>

<table_schema>
CREATE EXTERNAL TABLE department_dtls (
  id int,
  name string ,
  location_id int 
)
</table_schema>
</database_schema>

Question: {question}

Assistant: 
"""

Generate the final prompt by passing the question and instruction details as arguments to the prompt template. Then, invoke the model.

question_asked = "List of permanent employees who work in North America and  joined after Jan 1 2024"
prompt_template_for_query_generate = PromptTemplate.from_template(athena_prompt1)
prompt_data_for_query_generate = prompt_template_for_query_generate.format(question=question_asked,instruction_dtls=athena_sql_generating_instructions)
llm_generated_response = interactWithClaude(prompt_data_for_query_generate)
print(llm_generated_response.replace("<sql>", "").replace("</sql>", " ")  )

The model generates the SQL query for the user question by using the instructions and table details provided in the prompt.

SELECT employee_dtls.id, employee_dtls.name, employee_dtls.age, employee_dtls.dept_id, employee_dtls.emp_category
FROM employee_dtls 
WHERE employee_dtls.region_id = 'NA' 
  AND employee_dtls.emp_category = 'permanent'
  AND employee_dtls.joining_date > CAST('2024-01-01' AS DATE)

Significance of prompts and metadata in text-to-SQL generation

Understanding the details of tables and the data they contain is essential for both human SQL experts and generative AI-based text-to-SQL generation. These details, collectively known as metadata, provide crucial context for writing SQL queries. For the text-to-SQL example implemented in the previous section, we used prompts to convey specific instructions and table metadata to the model, enabling it to perform user tasks effectively. A question arises on what level of details we need to include in the table metadata. To clarify this point, we asked the model to generate SQL query for the same question three times with different prompts each time.

Prompt with no metadata

For the first test, we used a basic prompt containing just the SQL generating instructions and no table metadata. The basic prompt helped the model generate a SQL query for the given question, but it’s not helpful because the model made assumptions about table names, column names, and literal values used in the filter expressions.

Question: List of permanent employees who work in North America and joined after January 1, 2024.

Prompt definition:

Human: You are an Amazon Athena query expert whose output is a valid sql query. You are given the following Instructions for building the Amazon Athena query.
<Instructions>
{instruction_dtls}
</Instructions>

Question: {question}
Assistant:

SQL query generated:

SELECT emp.employee_id, emp.first_name, emp.last_name, emp.department_id
FROM employee emp
WHERE emp.contract = 'Permanent'
AND emp.region = 'North America'
AND CAST(emp.start_date AS  DATE) > CAST('2024-01-01' AS DATE)

Prompt with basic metadata

For solving the problem of assumed table names and column names, we added table metadata in DDL format in the second prompt. As a result, the model used the correct column names and data types and restricted the DATE casting to a literal string value. It got the SQL query syntactically correct, but one issue remains: the model assumed the literal values used in the filter expressions.

Question: List of permanent employees who work in North America and joined after January 1, 2024.

Prompt definition:

Human: You are an Amazon Athena query expert whose output is a valid sql query. You are given the following Instructions for building the Amazon Athena query.
<Instructions>
{instruction_dtls}
</Instructions>

Only use the following tables defined within the database_schema and table_schema XML-style tags:

<database_schema>
<table_schema>
CREATE EXTERNAL TABLE employee_dtls (
  id int,
  name string,
  age int ,
  dept_id int,
  emp_category string ,
  location_id int ,
  joining_date date
) PARTITIONED BY (
  region_id string
  )
</table_schema>

<table_schema>
CREATE EXTERNAL TABLE department_dtls (
  id int,
  name string ,
  location_id int 
)
</table_schema>
</database_schema>

Question: {question}
Assistant:

SQL query generated:

SELECT employee_dtls.id, employee_dtls.name, employee_dtls.age, employee_dtls.dept_id, employee_dtls.emp_category
FROM employee_dtls 
WHERE employee_dtls.region_id = 'NA' 
  AND employee_dtls.emp_category = 'permanent'
  AND employee_dtls.joining_date > CAST('2024-01-01' AS DATE)

Prompt with enriched metadata

Now we need to figure out how to provide the possible values of a column to the model. One way could be including metadata in the column for low cardinality columns. So we added column descriptions along with possible values in the third prompt. As a result, the model included the correct literal values in the filter expressions and gave accurate SQL query.

Question: List of permanent employees who work in North America and joined after Jan 1, 2024.

Prompt definition:

Human: You are an Amazon Athena query expert whose output is a valid sql query. You are given the following Instructions for building the Amazon Athena query.
<Instructions>
{instruction_dtls}
</Instructions>

Only use the following tables defined within the database_schema and table_schema XML-style tags:

<database_schema>
<table_schema>
CREATE EXTERNAL TABLE employee_dtls (
id int COMMENT 'Employee id',
name string COMMENT 'Employee name',
age int COMMENT 'Employee age',
dept_id int COMMENT 'Employee Departments ID',
emp_category string COMMENT 'Employee category. Contains TEMP For temporary, PERM for permanent, CONTR for contractors ',
location_id int COMMENT 'Location identifier of the Employee',
joining_date date  COMMENT 'Employee joining date',
CONSTRAINT pk_1 PRIMARY KEY  (id) ,
CONSTRAINT FK_1 FOREIGN KEY (dept_id) REFERENCES department_dtls(id)
) PARTITIONED BY (
region_id string COMMENT 'Region identifier. Contains AMER for Americas, EMEA for Europe, the Middle East, and Africa, APAC for Asia Pacific countries'
)
</table_schema>

<table_schema>
CREATE EXTERNAL TABLE department_dtls (
id int COMMENT 'Department id',
name string COMMENT 'Department name',
location_id int COMMENT 'Location identifier of the Department'
)
</table_schema>
</database_schema>

Question: {question}
Assistant:

SQL query generated:

SELECT employee_dtls.id, 
employee_dtls.name
FROM employee_dtls 
WHERE employee_dtls.emp_category = 'PERM'
  AND employee_dtls.region_id = 'AMER' 
  AND employee_dtls.joining_date > CAST('2024-01-01' AS DATE)

Prompt with foreign key constraints in the Metadata

Note that when we added the finer details to the metadata of the third prompt, we included foreign key constraints as well. This is done to help the model generate SQL for advanced queries that require joins. Adding foreign key constraints to the metadata helps the model identify the correct columns to be used in the join conditions. To demonstrate this point, we asked the model to write SQL for showing department details along with the employee details. For showing the department details, we need the department_dtls table. The model added department_dtls table to the SQL query and identified the right columns for the join condition based on foreign key constraint details included in the metadata.

Question: List of permanent employees who work in North America and joined after Jan 1, 2024.

SQL query generated:

SELECT
  employee_dtls.name AS employee_name,
  employee_dtls.age,
  department_dtls.name AS department_name
FROM employee_dtls 
JOIN department_dtls 
  ON employee_dtls.dept_id = department_dtls.id
WHERE 
  employee_dtls.emp_category = 'PERM'
  AND employee_dtls.region_id = 'AMER' 
  AND employee_dtls.joining_date > CAST('2024-01-01' AS DATE)

Additional observations

Though the model included relevant employee attributes in the SELECT clause, the exact list of attributes it included varied each time. Even for the same prompt definition, the model provided a varying list of attributes. The model randomly used one of the two approaches for casting the string literal value to date type. The first approach uses CAST('2024-01-01' AS DATE) and the second approach uses DATE '2024-01-01'.

Challenges in maintaining the metadata

Now that you understand how maintaining detailed metadata along with foreign key constraints helps the model in generating accurate SQL queries, let’s discuss how you can gather the necessary details of table metadata. The data lake and database catalogs support gathering and querying metadata, including table and column descriptions. However, making sure that these descriptions are accurate and up-to-date poses several practical challenges, such as:

Creating database objects with useful descriptions requires collaboration between technical and business teams to write detailed and meaningful descriptions. As tables undergo schema changes, updating metadata for each change can be time-consuming and requires effort.
Maintaining lists of possible values for the columns requires continuous updates.
Adding data transformation details to metadata can be challenging because of the dispersed nature of this information across data processing pipelines, making it difficult to extract and incorporate into table-level metadata.
Adding data lineage details to metadata faces challenges because of the fragmented nature of this information across data processing pipelines, making extraction and integration into table-level metadata complex.

Specific to the AWS Glue Data Catalog, more challenges arise, such as the following:

Creating AWS Glue tables through crawlers doesn’t automatically generate table or column descriptions, requiring manual updates to table definitions from the AWS Glue console.
Unlike traditional relational databases, AWS Glue tables don’t explicitly define or enforce primary keys or foreign keys. AWS Glue tables operate on a schema-on-read basis, where the schema is inferred from the data when querying. Therefore, there’s no direct support for specifying primary keys, foreign keys, or column descriptions in AWS Glue tables like there is in traditional databases.

Enriching the metadata

Listed here some ways that you can overcome the previously mentioned challenges in maintaining the metadata.

Enhance the table and column descriptions: Documenting table and column descriptions requires a good understanding of the business process, terminology, acronyms, and domain knowledge. The following are the different methods you can use to get these table and column descriptions into the AWS Glue Data Catalog.
- Use generative AI to generate better documentation: Enterprises often document their business processes, terminologies, and acronyms and make them accessible through company-specific portals. By following naming conventions for tables and columns, consistency in object names can be achieved, making them more relatable to business terminology and acronyms. Using generative AI models on Amazon Bedrock, you can enhance table and column descriptions by feeding the models with business terminology and acronym definitions along with the database schema objects. This approach reduces the time and effort required to generate detailed descriptions. The recently released metadata feature in Amazon DataZone, AI recommendations for descriptions in Amazon DataZone, is along these principles. After you generate the descriptions, you can update the column descriptions using any of the following options.
  - From the AWS Glue catalog UI
  - Using the AWS Glue SDK similar to Step 3a : Create employee_dtls Glue table for querying from Athena in the 0_create_tables_with_metadata.ipynb Jupyter Notebook
  - Add the COMMENTS in the DDL script of the table.
```
CREATE EXTERNAL TABLE <table_name> 
( column1 string COMMENT '<column_description>' ) 
PARTITIONED BY ( column2 string COMMENT '<column_description>' )
```

For AWS Glue tables cataloged from other databases:
- You can add table and column descriptions from the source databases using the crawler in AWS Glue.
- You can configure the EnableAdditionalMetadata Crawler option to crawl metadata such as comments and raw data types from the underlying data sources. The AWS Glue crawler will then populate the additional metadata in AWS Glue Data Catalog. This provides a way to document your tables and columns directly from the metadata defined in the underlying database.
Enhance the metadata with data profiling: As demonstrated in the previous section, providing the list of values in the employee category column and their meaning helped in generating the SQL query with more accurate filter conditions. We can provide such a list of values or data characteristics in the column descriptions with the help of data profiling. Data profiling is the process of analyzing and understanding the data and its characteristics as distinct values. By using data profiling insights, we can enhance column descriptions.
Enhance the metadata with details of partitions and a range of partition values: As demonstrated in the previous section, providing the list of partition values and their meaning in the partition column description helped in generating the SQL with more accurate filter conditions. For list partitions, we can add the list of the partition values and their meanings to the partition column description. For range partitions, we can add more context on the grain of the values like daily, monthly, and a specific range of values to the column description.

Enriching the prompt

You can enhance the prompts with query optimization rules like partition pruning. In the athena_sql_generating_instructions, defined as part of the 1_text-to-sql-for-athena.ipynb Jupyter Notebook, we added an instruction “For tables with partitions, include the filters on the relevant partition columns”. This instruction guides the model on how to handle partition pruning. In the example, we observed that the model added the relevant partition filter on the region_id partition column. These partition filters will speed up the SQL query execution and is one of the top query optimization techniques. You can add more such query optimization rules to the instructions. You can enhance these instructions with relevant SQL examples.

Cleanup

To clean up the resources, start by cleaning up the S3 bucket that was created by the CloudFormation stack. Then delete the CloudFormation stack using the following steps.

In the AWS Management Console, choose the name of the currently displayed Region and change it to US West (Oregon).
Navigate to AWS CloudFormation.
Choose Stacks.
Select texttosqlmetadata
Choose Delete.

Conclusion

The example presented in the post highlights the importance of enriched metadata in generating accurate SQL query using the text-to-SQL capabilities of Anthropic’s Claude model on Amazon Bedrock and discusses multiple ways to enrich the metadata. Amazon Bedrock is at the center of this text-to-SQL generation. Amazon Bedrock can help you build various generative AI applications including the metadata generation use case mentioned in the previous section. To get started with Amazon Bedrock, we recommend following the quick start in the GitHub repo and familiarizing yourself with building generative AI applications. After getting familiar with generative AI applications, see the GitHub Text-to-SQL workshop to learn more text-to-SQL techniques. See Build a robust Text-to-SQL solution and Best practices for Text-to-SQL for the recommended architecture and best practices to follow while implementing text-to-SQL generation.

About the author

Naidu Rongali is a Big Data and ML engineer at Amazon. He designs and develops data processing solutions for data intensive analytical systems supporting Amazon retail business. He has been working on integrating generative AI capabilities into the data lake and data warehouse systems using Amazon Bedrock AI models. Naidu has a PG diploma in Applied Statistics from the Indian Statistical Institute, Calcutta and BTech in Electrical and Electronics from NIT, Warangal. Outside of his work, Naidu practices yoga and goes trekking often.

AWS Big Data Blog

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Solution overview

Prerequisites

Granting Anthropic’s Claude permissions on Amazon Bedrock

Deploying the CloudFormation stack

Downloading Jupyter Notebooks to SageMaker

Solution implementation

Significance of prompts and metadata in text-to-SQL generation

Prompt with no metadata

Prompt with basic metadata

Prompt with enriched metadata

Prompt with foreign key constraints in the Metadata

Additional observations

Challenges in maintaining the metadata

Enriching the metadata

Enriching the prompt

Cleanup

Conclusion

About the author

Resources

Follow