Amazon Bedrock Evaluations

Evaluate foundation models, including custom and imported models, to find models that fit your needs. You can also evaluate your retrieval or end-to-end RAG workflow in Amazon Bedrock Knowledge Bases.

Overview

Amazon Bedrock provides evaluation tools for you to accelerate adoption of generative AI applications. Evaluate, compare, and select the foundation model for your use case with Model Evaluation. Prepare your RAG applications for production that are built on Amazon Bedrock Knowledge Bases or your own custom RAG systems by evaluating the retrieve or retrieve and generate functions.

UI screenshot

Evaluation types

Use an LLM as a Judge to evaluate model outputs using your custom prompt datasets with metrics such as correctness, completeness, and harmfulness.

Evaluate model outputs using traditional natural language algorithms and metrics like BERT Score, F1, and other exact matching techniques, using built-in prompt datasets or bring your own.

Evaluate model outputs with your own workforce or have AWS manage your evaluations on responses to your custom prompt datasets with built-in or custom metrics.

Evaluate the retrieval quality of your custom RAG system or Amazon Bedrock Knowledge Bases with your prompts and metrics such as context relevance and context coverage.

Evaluate the generated content of your end-to-end RAG workflow from either your custom RAG pipeline or Amazon Bedrock Knowledge Bases. Use your own prompts and metrics such as faithfulness (hallucination detection), correctness, and completeness.

Evaluate your end-to-end RAG workflow

Use retrieve and generate evaluations to evaluate the end-to-end retrieval-augmented generation (RAG) capability of your application. Ensure the generated content is correct, complete, limits hallucinations, and adheres to responsible AI principles. Either evaluate the performance of a Bedrock Knowledge Base or bring your own inference responses from your custom RAG system. Simply select an LLM to use as a judge with your Amazon Bedrock Knowledge Bases or for your custom RAG outputs, upload your dataset, and select the metrics most important for your evaluation.

UI screenshot

Ensure complete and relevant retrieval from your RAG system

Use RAG retrieval evaluations to evaluate the storage and retrieval settings of your Amazon Bedrock Knowledge Bases or custom RAG system. Ensure the retrieved content is relevant and covers the entire user query. Simply select an LLM to use as a judge, choose a Bedrock Knowledge Base to evaluate or include your custom RAG system retrievals in your prompt dataset, and select your metrics.

UI screenshot

Evaluate FMs to select the best one for your use case

Amazon Bedrock Model Evaluation allows you to use automatic and human evaluations to select FMs for a specific use case. Automatic (Programmatic) model evaluation uses curated and custom datasets and provides predefined metrics including accuracy, robustness, and toxicity. For subjective metrics, you can use Amazon Bedrock to set up a human evaluation workflow in a few quick steps. With human evaluations, you can bring your own datasets and define custom metrics, such as relevance, style, and alignment to brand voice. Human evaluation workflows can use your own employees as reviewers or you can engage a team managed by AWS to perform the human evaluation, where AWS hires skilled evaluators and manages the complete workflow on your behalf. You can also use an LLM-as-a-Judge to provide high quality evaluations on your dataset with metrics such as correctness, completeness, faithfulness (hallucination), as well as responsible AI metrics such as answer refusal and harmfulness. You can evaluate Bedrock models or any model anywhere by bringing your own inference responses in your input prompt dataset.

UI screenshot

Compare results across multiple evaluation jobs to make decisions faster

Use the compare feature in evaluations to see the results of any changes you made to your prompts, the models being evaluated, your custom RAG systems, or Bedrock Knowledge Bases.

UI screenshot