Generate synthetic data for evaluating RAG systems using Amazon Bedrock

Evaluating your Retrieval Augmented Generation (RAG) system to make sure it fulfils your business requirements is paramount before deploying it to production environments. However, this requires acquiring a high-quality dataset of real-world question-answer pairs, which can be a daunting task, especially in the early stages of development. This is where synthetic data generation comes into play. With Amazon Bedrock, you can generate synthetic datasets that mimic actual user queries, enabling you to evaluate your RAG system’s performance efficiently and at scale. With synthetic data, you can streamline the evaluation process and gain confidence in your system’s capabilities before unleashing it to the real world.

This post explains how to use Anthropic Claude on Amazon Bedrock to generate synthetic data for evaluating your RAG system. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Fundamentals of RAG evaluation

Before diving deep into how to evaluate a RAG application, let’s recap the basic building blocks of a naive RAG workflow, as shown in the following diagram.

Retrieval Augmented Generation

The workflow consists of the following steps:

In the ingestion step, which happens asynchronously, data is split into separate chunks. An embedding model is used to generate embeddings for each of the chunks, which are stored in a vector store.
When the user asks a question to the system, an embedding is generated from the questions and the top-k most relevant chunks are retrieved from the vector store.
The RAG model augments the user input by adding the relevant retrieved data in context. This step uses prompt engineering techniques to communicate effectively with the large language model (LLM). The augmented prompt allows the LLM to generate an accurate answer to user queries.
An LLM is prompted to formulate a helpful answer based on the user’s questions and the retrieved chunks.

Amazon Bedrock Knowledge Bases offers a streamlined approach to implement RAG on AWS, providing a fully managed solution for connecting FMs to custom data sources. To implement RAG using Amazon Bedrock Knowledge Bases, you begin by specifying the location of your data, typically in Amazon Simple Storage Service (Amazon S3), and selecting an embedding model to convert the data into vector embeddings. Amazon Bedrock then creates and manages a vector store in your account, typically using Amazon OpenSearch Serverless, handling the entire RAG workflow, including embedding creation, storage, management, and updates. You can use the RetrieveAndGenerate API for a straightforward implementation, which automatically retrieves relevant information from your knowledge base and generates responses using a specified FM. For more granular control, the Retrieve API is available, allowing you to build custom workflows by processing retrieved text chunks and developing your own orchestration for text generation. Additionally, Amazon Bedrock Knowledge Bases offers customization options, such as defining chunking strategies and selecting custom vector stores like Pinecone or Redis Enterprise Cloud.

A RAG application has many moving parts, and on your way to production you’ll need to make changes to various components of your system. Without a proper automated evaluation workflow, you won’t be able to measure the effect of these changes and will be operating blindly regarding the overall performance of your application.

To evaluate such a system properly, you need to collect an evaluation dataset of typical user questions and answers.

Moreover, you need to make sure you evaluate not only the generation part of the process but also the retrieval. An LLM without relevant retrieved context can’t answer the user’s question if the information wasn’t present in the training data. This holds true even if it has exceptional generation capabilities.

As such, a typical RAG evaluation dataset consists of the following minimum components:

A list of questions users will ask the RAG system
A list of corresponding answers to evaluate the generation step
The context or a list of contexts that contain the answer for each question to evaluate the retrieval

In an ideal world, you would take real user questions as a basis for evaluation. Although this is the optimal approach because it directly resembles end-user behavior, this is not always feasible, especially in the early stages of building a RAG system. As you progress, you should aim for incorporating real user questions into your evaluation set.

To learn more about how to evaluate a RAG application, see Evaluate the reliability of Retrieval Augmented Generation applications using Amazon Bedrock.

Solution overview

We use a sample use case to illustrate the process by building an Amazon shareholder letter chatbot that allows business analysts to gain insights about the company’s strategy and performance over the past years.

For the use case, we use PDF files of Amazon’s shareholder letters as our knowledge base. These letters contain valuable information about the company’s operations, initiatives, and future plans. In a RAG implementation, the knowledge retriever might use a database that supports vector searches to dynamically look up relevant documents that serve as the knowledge source.

The following diagram illustrates the workflow to generate the synthetic dataset for our RAG system.

synthetic dataset generation workflow

The workflow includes the following steps:

Load the data from your data source.
Chunk the data as you would for your RAG application.
Generate relevant questions from each document.
Generate an answer by prompting an LLM.
Extract the relevant text that answers the question.
Evolve the question according to a specific style.
Filter questions and improve the dataset either using domain experts or LLMs using critique agents.

We use a model from the Anthropic’s Claude 3 model family to extract questions and answers from our knowledge source, but you can experiment with other LLMs as well. Amazon Bedrock makes this effortless by providing standardized API access to many FMs.

For the orchestration and automation steps in this process, we use LangChain. LangChain is an open source Python library designed to build applications with LLMs. It provides a modular and flexible framework for combining LLMs with other components, such as knowledge bases, retrieval systems, and other AI tools, to create powerful and customizable applications.

The next sections walk you through the most important parts of the process. If you want to dive deeper and run it yourself, refer to the notebook on GitHub.

Load and prepare the data

First, load the shareholder letters using LangChain’s PyPDFDirectoryLoader and use the RecursiveCharacterTextSplitter to split the PDF documents into chunks. The RecursiveCharacterTextSplitter divides the text into chunks of a specified size while trying to preserve context and meaning of the content. It’s a good way to start when working with text-based documents. You don’t have to split your documents to create your evaluation dataset if your LLM supports a context window that is large enough to fit your documents, but you could potentially end up with a lower quality of generated questions due to the larger size of the task. You want to have the LLM generate multiple questions per document in this case.

from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders.pdf import PyPDFLoader, PyPDFDirectoryLoader

# Load PDF documents from directory
loader = PyPDFDirectoryLoader("./synthetic_dataset_generation/")  
documents = loader.load()
# Use recursive character splitter, works better for this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Split documents into small chunks
    chunk_size = 1500,  
    # Overlap chunks to reduce cutting sentences in half
    chunk_overlap  = 100,
    separators=["\n\n", "\n", ".", " ", ""],
)

# Split loaded documents into chunks
docs = text_splitter.split_documents(documents)

To demonstrate the process of generating a corresponding question and answer and iteratively refining them, we use an example chunk from the loaded shareholder letters throughout this post:

“page_content=''Our AWS and Consumer businesses have had different demand trajectories during the pandemic. In the\nfirst year of the pandemic, AWS revenue continued to grow at a rapid clip—30% year over year (“Y oY”) in2020 on a $35 billion annual revenue base in 2019—but slower than the 37% Y oY growth in 2019. [...] This shift by so many companies (along with the economy recovering) helped re-accelerate AWS’s revenue growth to 37% Y oY in 2021.\nConversely, our Consumer revenue grew dramatically in 2020. In 2020, Amazon’s North America and\nInternational Consumer revenue grew 39% Y oY on the very large 2019 revenue base of $245 billion; and,this extraordinary growth extended into 2021 with revenue increasing 43% Y oY in Q1 2021. These areastounding numbers. We realized the equivalent of three years’ forecasted growth in about 15 months.\nAs the world opened up again starting in late Q2 2021, and more people ventured out to eat, shop, and travel,”

Generate an initial question

To facilitate prompting the LLM using Amazon Bedrock and LangChain, you first configure the inference parameters. To accurately extract more extensive contexts, set the max_tokens parameter to 4096, which corresponds to the maximum number of tokens the LLM will generate in its output. Additionally, define the temperature parameter as 0.2 because the goal is to generate responses that adhere to the specified rules while still allowing for a degree of creativity. This value differs for different use cases and can be determined by experimentation.

import boto3

from langchain_community.chat_models import BedrockChat

# set up a Bedrock-runtime client for inferencing large language models
boto3_bedrock = boto3.client('bedrock-runtime')
# Choosing claude 3 Haiku due to cost and performance efficiency
claude_3_haiku = "anthropic.claude-3-haiku-20240307-v1:0"
# Set-up langchain LLM for implementing the synthetic dataset generation logic

# for each model provider there are different parameters to define when inferencing against the model
inference_modifier = {
                        "max_tokens": 4096,
                        "temperature": 0.2
                    }
                                         
llm = BedrockChat(model_id = claude_3_haiku,
                    client = boto3_bedrock, 
                    model_kwargs = inference_modifier 
                    )

You use each generated chunk to create synthetic questions that mimic those a real user might ask. By prompting the LLM to analyze a portion of the shareholder letter data, you generate relevant questions based on the information presented in the context. We use the following sample prompt to generate a single question for a specific context. For simplicity, the prompt is hardcoded to generate a single question, but you can also instruct the LLM to generate multiple questions with a single prompt.

The rules can be adapted to better guide the LLM in generating questions that reflect the types of queries your users would pose, tailoring the approach to your specific use case.

# Create a prompt template to generate a question a end-user could have about a given context
initial_question_prompt_template = PromptTemplate(
    input_variables=["context"],
    template="""
    <Instructions>
    Here is some context:
    <context>
    {context}
    </context>

    Your task is to generate 1 question that can be answered using the provided context, following these rules:

    <rules>
    1. The question should make sense to humans even when read without the given context.
    2. The question should be fully answered from the given context.
    3. The question should be framed from a part of context that contains important information. It can also be from tables, code, etc.
    4. The answer to the question should not contain any links.
    5. The question should be of moderate difficulty.
    6. The question must be reasonable and must be understood and responded by humans.
    7. Do not use phrases like 'provided context', etc. in the question.
    8. Avoid framing questions using the word "and" that can be decomposed into more than one question.
    9. The question should not contain more than 10 words, make use of abbreviations wherever possible.
    </rules>

    To generate the question, first identify the most important or relevant part of the context. Then frame a question around that part that satisfies all the rules above.

    Output only the generated question with a "?" at the end, no other text or characters.
    </Instructions>
    
    """)

The following is the generated question from our example chunk:

What is the price-performance improvement of AWS Graviton2 chip over x86 processors?

Generate answers

To use the questions for evaluation, you need to generate a reference answer for each of the questions to test against. With the following prompt template, you can generate a reference answer to the created question based on the question and the original source chunk:

# Create a prompt template that takes into consideration the the question and generates an answer
answer_prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template="""
    <Instructions>
    <Task>
    <role>You are an experienced QA Engineer for building large language model applications.</role>
    <task>It is your task to generate an answer to the following question <question>{question}</question> only based on the <context>{context}</context></task>
    The output should be only the answer generated from the context.

    <rules>
    1. Only use the given context as a source for generating the answer.
    2. Be as precise as possible with answering the question.
    3. Be concise in answering the question and only answer the question at hand rather than adding extra information.
    </rules>

    Only output the generated answer as a sentence. No extra characters.
    </Task>
    </Instructions>
    
    Assistant:""")

The following is the generated answer based on the example chunk:

“The AWS revenue grew 37% year-over-year in 2021.”

Extract relevant context

To make the dataset verifiable, we use the following prompt to extract the relevant sentences from the given context to answer the generated question. Knowing the relevant sentences, you can check whether the question and answer are correct.

# To check whether an answer was correctly formulated by the large language model you get the relevant text passages from the documents used for answering the questions.
source_prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template="""Human:
    <Instructions>
    Here is the context:
    <context>
    {context}
    </context>

    Your task is to extract the relevant sentences from the given context that can potentially help answer the following question. You are not allowed to make any changes to the sentences from the context.

    <question>
    {question}
    </question>

    Output only the relevant sentences you found, one sentence per line, without any extra characters or explanations.
    </Instructions>
    Assistant:""")

The following is the relevant source sentence extracted using the preceding prompt:

“This shift by so many companies (along with the economy recovering) helped re-accelerate AWS's revenue growth to 37% Y oY in 2021.”

Refine questions

When generating question and answer pairs from the same prompt for the whole dataset, it might appear that the questions are repetitive and similar in form, and therefore don’t mimic real end-user behavior. To prevent this, take the previously created questions and prompt the LLM to modify them according to the rules and guidance established in the prompt. By doing so, a more diverse dataset is synthetically generated. The prompt for generating questions tailored to your specific use case heavily depends on that particular use case. Therefore, your prompt must accurately reflect your end-users by setting appropriate rules or providing relevant examples. The process of refining questions can be repeated as many times as necessary.

# To generate a more versatile testing dataset you alternate the questions to see how your RAG systems performs against differently formulated of questions
question_compress_prompt_template = PromptTemplate(
    input_variables=["question"],
    template="""
    <Instructions>
    <role>You are an experienced linguistics expert for building testsets for large language model applications.</role>

    <task>It is your task to rewrite the following question in a more indirect and compressed form, following these rules:

    <rules>
    1. Make the question more indirect
    2. Make the question shorter
    3. Use abbreviations if possible
    </rules>

    <question>
    {question}
    </question>

    Your output should only be the rewritten question with a question mark "?" at the end. Do not provide any other explanation or text.
    </task>
    </Instructions>
    
    """)

Users of your application might not always use your solution in the same way, for instance using abbreviations when asking questions. This is why it’s crucial to develop a diverse dataset:

“AWS rev YoY growth in ’21?”

Automate dataset generation

To scale the process of the dataset generation, we iterate over all the chunks in our knowledge base; generate questions, answers, relevant sentences, and refinements for each question; and save them to a pandas data frame to prepare the full dataset.

To provide a clearer understanding of the structure of the dataset, the following table presents a sample row based on the example chunk used throughout this post.

Chunk	Our AWS and Consumer businesses have had different demand trajectories during the pandemic. In the\nfirst year of the pandemic, AWS revenue continued to grow at a rapid clip—30% year over year (“Y oY”) in2020 on a $35 billion annual revenue base in 2019—but slower than the 37% Y oY growth in 2019. […] This shift by so many companies (along with the economy recovering) helped re-accelerate AWS’s revenue growth to 37% Y oY in 2021.\nConversely, our Consumer revenue grew dramatically in 2020. In 2020, Amazon’s North America and\nInternational Consumer revenue grew 39% Y oY on the very large 2019 revenue base of $245 billion; and,this extraordinary growth extended into 2021 with revenue increasing 43% Y oY in Q1 2021. These areastounding numbers. We realized the equivalent of three years’ forecasted growth in about 15 months.\nAs the world opened up again starting in late Q2 2021, and more people ventured out to eat, shop, and travel,”
Question	“What was the YoY growth of AWS revenue in 2021?”
Answer	“The AWS revenue grew 37% year-over-year in 2021.”
Source Sentence	“This shift by so many companies (along with the economy recovering) helped re-accelerate AWS’s revenue growth to 37% Y oY in 2021.”
Evolved Question	“AWS rev YoY growth in ’21?”

On average, the generation of questions with a given context of 1,500–2,000 tokens results in an average processing time of 2.6 seconds for a set of initial question, answer, evolved question, and source sentence discovery using Anthropic Claude 3 Haiku. The generation of 1,000 sets of questions and answers costs approximately $2.80 USD using Anthropic Claude 3 Haiku. The pricing page gives a detailed overview of the cost structure. This results in a more time- and cost-efficient generation of datasets for RAG evaluation compared to the manual generation of these questions sets.

Improve your dataset using critique agents

Although using synthetic data is a good starting point, the next step should be to review and refine the dataset, filtering out or modifying questions that aren’t relevant to your specific use case. One effective approach to accomplish this is by using critique agents.

Critique agents are a technique used in natural language processing (NLP) to evaluate the quality and suitability of questions in a dataset for a particular task or application using a machine learning model. In our case, the critique agents are employed to assess whether the questions in the dataset are valid and appropriate for our RAG system.

The two main metrics evaluated by the critique agents in our example are question relevance and answer groundedness. Question relevance determines how relevant the generated question is for a potential user of our system, and groundedness assesses whether the generated answers are indeed based on the given context.

groundedness_check_prompt_template = PromptTemplate(
    input_variables=["context","question"],
    template="""
    <Instructions>
    You will be given a context and a question related to that context.

    Your task is to provide an evaluation of how well the given question can be answered using only the information provided in the context. Rate this on a scale from 1 to 5, where:

    1 = The question cannot be answered at all based on the given context
    2 = The context provides very little relevant information to answer the question
    3 = The context provides some relevant information to partially answer the question 
    4 = The context provides substantial information to answer most aspects of the question
    5 = The context provides all the information needed to fully and unambiguously answer the question

    First, read through the provided context carefully:

    <context>
    {context}
    </context>

    Then read the question:

    <question>
    {question}
    </question>

    Evaluate how well you think the question can be answered using only the context information. Provide your reasoning first in an <evaluation> section, explaining what relevant or missing information from the context led you to your evaluation score in only one sentence.

    Provide your evaluation in the following format:

    <rating>(Your rating from 1 to 5)</rating>
    
    <evaluation>(Your evaluation and reasoning for the rating)</evaluation>


    </Instructions>
    
    """)

relevance_check_prompt_template = PromptTemplate(
    input_variables=["question"],
    template="""
    <Instructions>
    You will be given a question related to Amazon Shareholder letters. Your task is to evaluate how useful this question would be for a financial and business analyst working in wallstreet.

    To evaluate the usefulness of the question, consider the following criteria:

    1. Relevance: Is the question directly relevant to your work? Questions that are too broad or unrelated to this domain should receive a lower rating.

    2. Practicality: Does the question address a practical problem or use case that analysts might encounter? Theoretical or overly academic questions may be less useful.

    3. Clarity: Is the question clear and well-defined? Ambiguous or vague questions are less useful.

    4. Depth: Does the question require a substantive answer that demonstrates understanding of financial topics? Surface-level questions may be less useful.

    5. Applicability: Would answering this question provide insights or knowledge that could be applied to real-world company evaluation tasks? Questions with limited applicability should receive a lower rating.

    Provide your evaluation in the following format:

    <rating>(Your rating from 1 to 5)</rating>
    
    <evaluation>(Your evaluation and reasoning for the rating)</evaluation>

    Here is the question:

    {question}
    </Instructions>
    """)

Evaluating the generated questions helps with assessing the quality of a dataset and eventually the quality of the evaluation. The generated question was rated very well:

Groundedness score: 5
“The context provides the exact information needed to answer the question[...]”

Relevance score: 5
“This question is highly relevant and useful for a financial and business analyst working on Wall Street. AWS (Amazon Web Services) is a key business segment for Amazon, and understanding its year-over-year (YoY) revenue growth is crucial for evaluating the company's overall performance and growth trajectory.[...].

Best practices for generating synthetic datasets

Although generating synthetic datasets offers numerous benefits, it’s essential to follow best practices to maintain the quality and representativeness of the generated data:

Combine with real-world data – Although synthetic datasets can mimic real-world scenarios, they might not fully capture the nuances and complexities of actual human interactions or edge cases. Combining synthetic data with real-world data can help address this limitation and create more robust datasets.
Choose the right model – Choose different LLMs for dataset creation than used for RAG generation in order to avoid self-enhancement bias.
Implement robust quality assurance – You can employ multiple quality assurance mechanisms, such as critique agents, human evaluation, and automated checks, to make sure the generated datasets meet the desired quality standards and accurately represent the target use case.
Iterate and refine – You should treat synthetic dataset generation as an iterative process. Continuously refine and improve the process based on feedback and performance metrics, adjusting parameters, prompts, and quality assurance mechanisms as needed.
Domain-specific customization – For highly specialized or niche domains, consider fine-tuning the LLM (such as with PEFT or RLHF) by injecting domain-specific knowledge to improve the quality and accuracy of the generated datasets.

Conclusion

The generation of synthetic datasets is a powerful technique that can significantly enhance the evaluation process of your RAG system, especially in the early stages of development when real-world data is scarce or difficult to obtain. By taking advantage of the capabilities of LLMs, this approach enables the creation of diverse and representative datasets that accurately mimic real human interactions, while also providing the scalability necessary to meet your evaluation needs.

Although this approach offers numerous benefits, it’s essential to acknowledge its limitations. Firstly, the quality of the synthetic dataset heavily relies on the performance and capabilities of the underlying language model, knowledge retrieval system, and quality of prompts used for generation. Being able to understand and adjust the prompts for generation is crucial in this process. Biases and limitations present in these components may be reflected in the generated dataset. Additionally, capturing the full complexity and nuances of real-world interactions can be challenging because synthetic datasets may not account for all edge cases or unexpected scenarios.

Despite these limitations, generating synthetic datasets remains a valuable tool for accelerating the development and evaluation of RAG systems. By streamlining the evaluation process and enabling iterative development cycles, this approach can contribute to the creation of better-performing AI systems.

We encourage developers, researchers, and enthusiasts to explore the techniques mentioned in this post and the accompanying GitHub repository and experiment with generating synthetic datasets for your own RAG applications. Hands-on experience with this technique can provide valuable insights and contribute to the advancement of RAG systems in various domains.

About the Authors

Johannes Langer is a Senior Solutions Architect at AWS, working with enterprise customers in Germany. Johannes is passionate about applying machine learning to solve real business problems. In his personal life, Johannes enjoys working on home improvement projects and spending time outdoors with his family.

Lukas Wenzel is a Solutions Architect at Amazon Web Services in Hamburg, Germany. He focuses on supporting software companies building SaaS architectures. In addition to that, he engages with AWS customers on building scalable and cost-efficient generative AI features and applications. In his free-time, he enjoys playing basketball and running.

David Boldt is a Solutions Architect at Amazon Web Services. He helps customers build secure and scalable solutions that meet their business needs. He is specialized in machine learning to address industry-wide challenges, using technologies to drive innovation and efficiency across various sectors.

AWS Machine Learning Blog