A qualitative approach to Evaluating Large Language Models for Responsible Gen AI on AWS

By: Randall Hunt, VP, Cloud Strategy and Innovation – Caylent
By: Ali Arabi, Sr. Machine Learning Architect – Caylent
By: Ajit Kumar K.P, Sr. Gen AI Partner Solutions Architect – AWS
By: Aditya Kaseebhatla, Sr. Partner Solutions Architect – AWS
By: Said Nechab, Sr. AI/ML Partner Solutions Architect – AWS

Caylent Fuels Cloud Native

Caylent

Introduction

The rapid advancement of Generative AI technologies is creating transformative opportunities across industries. Organizations are eager to capitalize on Generative AI’s potential to drive innovation, accelerate digital initiatives, and gain a competitive edge. As these capabilities evolve quickly, there is a growing need for guidance on designing secure, cost-effective, and well-governed Generative AI solutions, as well as operationalizing and scaling these technologies.

Evaluating Large Language Models (LLMs) for the responsible deployment of Generative AI applications is a critical aspect of this journey. Partners have a unique opportunity to help customers navigate the complexity, mitigate risks, and unlock the full potential of Generative AI through tailored solutions and services.

By collaborating with AWS and leveraging its comprehensive AI/ML services, partners can provide enterprise-grade Generative AI solutions. This allows customers to innovate with Generative AI while benefiting from AWS’s robust security, governance, and operational best practices tailored to take Generative AI workloads to production.

Customer Challenges

As a steady stream of new large language models are released, practitioners must swiftly assess whether the latest model aligns with their particular use cases and requirements. Simultaneously, they need to closely monitor the performance of the models currently deployed and determine if and when an upgrade is warranted.

At this point, it is widely recognized that a single model is unlikely to be the optimal solution across all use cases. Therefore, it is crucial to evaluate multiple models along various dimensions, such as accuracy, robustness, toxicity, and bias, in order to identify the most suitable model for specific applications, be it text generation, coding, chatbots, summarization, or others.

The Amazon Bedrock Model Evaluation capability serves as a valuable resource for evaluating foundation models. This feature enables users to run model comparison jobs, empowering them to select the model best suited for their downstream generative AI applications. The model evaluation tool supports common use cases for large language models (LLMs), including text generation, text classification, question answering, and text summarization.

While the Amazon Bedrock Model Evaluation service primarily caters to the pre-trained models available on the Amazon Bedrock, Caylent has developed an additional Human-in-the-Loop model evaluation framework called “bedrock-benchmark”. This framework can be used to qualitatively assess the consistency of Amazon Bedrock large language models (LLMs) for specific tasks. This blog post outlines the high-level architecture of the solution, along with various best practices around prompt management, cost optimization, responsible use policies, and architectural principles for building an agile and highly extensible solution for LLM evaluations.

Solution Overview

Automated benchmarking for assessing LLM performance faces significant challenges due to the open-ended nature of tasks that tend to require a subjective human evaluation of the quality of responses. To address this limitation, bedrock-benchmark framework follows a Human-in-the-loop (HITL) approach, where human evaluators assess the quality and consistency of responses generated by LLMs across various tasks over time.

There are several public LLM benchmarking datasets that cater to different task types the LLMs are designed for. The bedrock-benchmark solution focuses on evaluating LLM performance across tasks such as coding, chatbot, summarization and instruction-following ability.

Some of the popular benchmark datasets and metrics commonly used for these tasks include the HumanEval dataset (for code generation), MT Bench (for instruction-following ability), MBPP (for python code generation), and Chatbot Arena (for chatbot assistant).

At a high-level, the bedrock-benchmark solution consists of the following key components:

Model Repository — The model repository currently includes various large language models (LLMs) available on the Amazon Bedrock. This can be further extended to include foundation models (FMs) on Amazon SageMaker as well, as new models become available.

Prompt Catalog — An Amazon DynamoDB table is used as a prompt catalog to store the prompts that will be used to invoke the Amazon Bedrock LLMs. This allows for the versioning and repeatability of prompts. Another Amazon DynamoDB table serves as a repository to store LLM’s responses that will be later evaluated during the HITL evaluation process.

Workflow Orchestration — AWS Step Functions are used to orchestrate the workflow, which includes invoking the Amazon Bedrock LLMs, processing their responses, and recording the results. Amazon EventBridge is used to trigger the workflow on a scheduled basis or as needed.

UI for human feedback — A Streamlit-based application is built to facilitate human evaluation and feedback. This application presents the human evaluator with scenarios where there was a drift in model response quality and captures human feedback for future analysis.

Solution Architecture and Implementation

Figure 1- Architecture of the Caylent’s LLM Benchmarking Solution

The Model Repository currently includes the following models available on Amazon Bedrock. New models from Amazon Bedrock or Amazon SageMaker can also be onboarded as needed.

Amazon: Titan Express, Titan Large
Anthropic: Claude V2.1, Claude Sonnet, Claude Haiku, Claude Opus
Cohere: Command-R, Command-R+
Meta: Llama3 8B Instruct, Llama3 70B Instruct
AI21: Jurassic Ultra, Jurassic Mid
Mistral: Mistral-7b-instruct, Mixtral-8x7b-instruct

The prompts used for benchmarking the LLMs are stored in Amazon DynamoDB, as it provides a scalable and reliable option to catalog and version prompts for repeatability. Customers have the flexibility to bring their own benchmarking datasets and add them to the prompt catalog (step 1 in figure 1). For a simple illustration, the solution currently includes 20 prompts across 5 different categories: knowledge, code, instruct, creativity, and reflection. Figure 2 shows a snapshot of the Amazon DynamoDB table created to store the curated prompts belonging to these categories.

Figure 2- Prompt catalog in the Amazon DynamoDB table

AWS Step Functions are primarily used here to orchestrate the entire workflow, while Amazon EventBridge is used to trigger the workflow (steps 2 and 3 in figure 1). By default, Amazon Event Bridge is scheduled to trigger the workflow every day, but a Machine Learning engineer can choose to change the schedule as per the need. Once the workflow is initiated, AWS step functions run all the 6 branches of AWS Lambda functions in parallel. Each Lambda function is dedicated to a specific model provider and contains the logic to invoke, process and record the responses of the LLMs (steps 4, 5 and 6 in figure 1).

Figure 3 – AWS Step Functions for workflow orchestration

Once the workflow has successfully executed, there is a complete record of all the LLMs that were invoked, with their respective input prompts and output responses recorded in the Amazon DynamoDB table (step 6 in figure 1). There are other useful metrics such as input and output token sizes, model configurations, date of invocation and latency recorded as well. If the latest response from any LLM in any category (coding, chatbot, summarization, and instruction following ability) differs from the baseline response, it indicates there is a quality drift and this has to be inspected by a Human evaluator.

UI for human evaluation and feedback

To facilitate human evaluation and feedback, a Streamlit-based application has been built (step 7 in figure 1). When this application is executed, it displays cases where the most recent response from any LLM in any category (such as coding, chatbot, summarization, or instruction-following) is different from its baseline response. In other words, it presents all the instances where there is a drift in model response quality. The human evaluator can then score the latest response by choosing between- 0 to indicate “INCORRECT/WORSE”, 1 for “SAME” and 2 for “BETTER” response. This score is recorded for future analysis (step 8 in figure 1).

Figure 4 shows a snapshot of the Streamlit application deployed in Amazon SageMaker Studio.

Figure 4- Streamlit application for HITL evaluation

Over a period of time, the user feedback received from running this workflow for each LLM can be visualized as an informative radar or spider chart as shown in Figure 4. This chart can be used to quickly demonstrate the proficiency and consistency in quality of LLMs across different tasks over time.

Figure 5- Qualitative performance visualization

Caylent plans to further extended this framework to include cost as a metric so that the price/performance of models can also be compared to evaluate return on investment. As LLMs continue to evolve rapidly, benchmarking price/performance ensures the most appropriate model is used for the right use-case.

Customer reference

A Fintech company in North America that provides risk management solutions for small medium businesses was exploring integrating Generative AI features into their product portfolio. Caylent provided an Innovation Engine, a team of architects and engineers with machine learning, big data, application, and software expertise to prioritize experiments, prove value and implement Generative AI solutions in production.

Through this framework, Caylent built out proof-of-values for two of their product lines and deployed them into production, enabling them to offer new AI-enabled features to their end users.

Conclusion

This guide presented a solution created by Caylent that allows customers to build a human-in-the-loop LLM evaluation and benchmarking workflow for different tasks, built using AWS services and a Streamlit UI application.

Caylent offers a full suite of Generative AI offerings on AWS. Starting with Generative AI Strategy Catalyst, you can start the ideation process to guide you through the art of the possible for your business. Using these new ideas, you can implement the Generative AI Knowledge Base to build a quick, out-of-the-box solution integrated with your company’s data to enable powerful search capabilities using natural language queries. Caylent’s Generative AI Proof of Value Catalyst can further help you build an AI roadmap for your company and demonstrate how Generative AI will play a part.

Caylent offers solutions to meet customers wherever they are in their Generative AI adoption journey. Caylent’s AI Innovation Engine, an embedded, agile, multidisciplinary AI team, helps to fast track scaled development of customers’ Generative AI initiatives by taking a portfolio approach, leading a business value exploration to determine the optimal path from idea to impact, rapidly prototyping, and releasing to production.

Additionally, Caylent’s proprietary Generative AI framework MeteorAI accelerates front-end development through templatized role-based access control and fine-grained access control to securely integrate customers’ existing data. It expedites the production release of Generative AI solutions, redefining how organizations build bespoke internal and consumer-facing Generative AI applications.

Call To Action

If you’re looking to unlock the potential of Generative AI for your organization, Caylent is an AWS premier tier services partner. Caylent has deep expertise in operationalizing Generative AI on AWS securely and responsibly, and can help you navigate the rapid evolution of models. Best practices. This includes taking advantage of AWS’s comprehensive AI/ML services and operational excellence to drive innovation. Reach out to Caylent to get started on executing a Generative AI strategy tailored to your business needs.

.

.

Caylent – AWS Partner Spotlight

Caylent is proudly an all-in AWS Premier Consulting Partner, with almost a decade of experience. Our clients rely on us to bring their business goals and product visions to life. That’s why we partner with AWS, the ready innovator and market leader in public cloud platforms. From Delivery to Operations to Sales, our teams align with AWS at every level and welcome AWS’ continued commitment to making the art of the possible, doable.

Contact Caylent | Partner Overview | AWS Marketplace

AWS Partner Network (APN) Blog

A qualitative approach to Evaluating Large Language Models for Responsible Gen AI on AWS

Solution Overview

Solution Architecture and Implementation

Caylent – AWS Partner Spotlight

Resources

Follow