Amazon Bedrock 지능적 프롬프트 라우팅 및 캐싱을 통해 비용 및 지연 시간 단축 (미리보기)

오늘은 Amazon Bedrock에서 생성형 AI 애플리케이션의 비용과 지연 시간을 줄이는 데 도움이 되는 두 가지 기능을 평가판으로 도입했습니다.

Amazon Bedrock Intelligent Prompt Routing – 모델을 간접 호출할 때 이제 동일한 모델 제품군의 파운데이션 모델(FM) 결합을 사용하여 품질과 비용을 최적화할 수 있습니다. 예를 들어 Anthropic의 Claude 모델 제품군을 사용하면 Amazon Bedrock에서 프롬프트의 복잡성에 따라 Claude 3.5 Sonnet과 Claude 3 Haiku 간의 요청을 지능적으로 라우팅할 수 있습니다. 마찬가지로, Amazon Bedrock은 Meta Llama 3.1 70B와 8B 간에 요청을 라우팅할 수 있습니다. 프롬프트 라우터는 응답 품질과 비용을 최적화하면서 각 요청에 대해 최상의 성능을 제공할 모델을 예측합니다. 이 라우터는 복잡하지 않은 쿼리를 더 작고 빠르며 비용 효율적인 모델로 처리할 수 있고 복잡한 쿼리는 더 많은 기능의 모델로 라우팅할 수 있는 고객 서비스 도우미와 같은 애플리케이션에 특히 유용합니다. Intelligent Prompt Routing은 정확도 저하 없이 비용을 최대 30% 절감시킬 수 있습니다.

Amazon Bedrock Prompt Caching – 여러 모델 간접 호출에서 자주 사용하는 컨텍스트를 프롬프트에 캐시할 수 있습니다. 이 기능은 사용자가 동일한 문서에 대해 여러 질문을 하는 문서 Q&A 시스템이나 코드 파일에 대한 컨텍스트를 유지해야 하는 코딩 도우미와 같이 동일한 컨텍스트를 반복적으로 사용하는 애플리케이션에 특히 유용합니다. 캐시된 컨텍스트는 각 액세스 후 최대 5분 동안 사용할 수 있습니다. Amazon Bedrock의 프롬프트 캐싱은 지원되는 모델의 비용을 최대 90%, 지연 시간을 최대 85% 절감할 수 있습니다.

이러한 기능을 통해 지연 시간을 줄이고 성능과 비용 효율성의 균형을 쉽게 맞출 수 있습니다. 애플리케이션에서 이러한 기능을 어떻게 사용할 수 있는지 살펴보겠습니다.

Amazon Bedrock Intelligent Prompt Routing 사용하기
Amazon Bedrock Intelligent Prompt Rouing은 고급 프롬프트 일치 및 모델 이해 기술을 사용하여 모든 요청에 대한 각 모델의 성능을 예측하고 응답 품질과 비용을 최적화합니다. 평가판 기간 중에 Anthropic의 Claude 및 Meta Lama 모델 제품군에 대한 기본 프롬프트 라우터를 사용할 수 있습니다.

지능형 프롬프트 라우팅에는 AWS Management Console, AWS Command Line Interface(AWS CLI) 및 AWS SDK를 통해 액세스할 수 있습니다. Amazon Bedrock 콘솔에 있는 탐색 창의 파운데이션 모델에서 프롬프트 라우터를 선택합니다.

Anthropic Prompt Router 기본 라우터를 선택하여 자세한 정보를 알아보겠습니다.

프롬프트 라우터의 구성을 보면 교차 리전 추론 프로필을 사용하여 Claude 3.5 Sonnet과 Claude 3 Haiku 간의 요청을 라우팅하고 있음을 알 수 있습니다. 라우팅 기준은 런타임 시 라우터 내부 모델에서 예측한 대로 각 프롬프트에 대해 가장 큰 모델과 가장 작은 모델의 응답 간 품질 차이를 정의합니다. 선택한 모델 중 원하는 성능 기준을 충족하는 모델이 없을 때 사용되는 폴백 모델은 Anthropic의 Claude 3.5 Sonnet입니다.

플레이그라운드에서 열기를 선택하여 프롬프트 라우터로 채팅하고 다음 프롬프트를 입력합니다.

Alice has N brothers and she also has M sisters. How many sisters does Alice’s brothers have?

결과는 빠르게 제공됩니다. 오른쪽에의 새로운 라우터 지표 아이콘을 선택하여 프롬프트 라우터에서 어떤 모델을 선택했는지 확인합니다. 이 경우에는 질문이 다소 복잡하기 때문에 Anthropic의 Claude 3.5 Sonnet을 사용했습니다.

이제 동일한 프롬프트 라우터에 간단한 질문을 던지겠습니다.

Describe the purpose of a 'hello world' program in one line.

이번에는 Anthropic의 Claude 3 Haiku가 프롬프트 라우터를 통해 선택되었습니다.

Meta Prompt Router를 선택하여 구성을 확인합니다. 70B 모델을 폴백으로 사용하면서 Llama 3.1 70B 및 8B에 대한 교차 리전 추론 프로필을 사용하고 있습니다.

프롬프트 라우터는 Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents 등의 다른 Amazon Bedrock 기능과 통합되거나 평가를 수행할 때 사용됩니다. 예를 들어 여기서는 사용 사례에서 프롬프트 라우터를 다른 모델이나 프롬프트 라우터와 비교하는 데 도움이 되는 모델 평가를 생성합니다.

애플리케이션에서 프롬프트 라우터를 사용하려면 Amazon Bedrock API에서 프롬프트 라우터 Amazon 리소스 이름(ARN)을 모델 ID로 설정해야 합니다. 이 라우터가 AWS CLI와 AWS SDK에서 어떻게 작동하는지 살펴보겠습니다.

AWS CLI로 Amazon Bedrock Intelligent Prompt Routing 사용
Amazon Bedrock API는 프롬프트 라우터를 처리하도록 확장되었습니다. 예를 들어 ListPromptRouters를 사용하여 AWS 리전의 기존 프롬프트 라우트를 나열할 수 있습니다.

aws bedrock list-prompt-routers

출력에는 콘솔에서 본 것과 비슷한 기존 프롬프트 라우터에 대한 요약이 표시됩니다.

이전 명령의 전체 출력은 다음과 같습니다.

{
    "promptRouterSummaries": [
        {
            "promptRouterName": "Anthropic Prompt Router",
            "routingCriteria": {
                "responseQualityDifference": 0.26
            },
            "description": "Routes requests among models in the Claude family",
            "createdAt": "2024-11-20T00:00:00+00:00",
            "updatedAt": "2024-11-20T00:00:00+00:00",
            "promptRouterArn": "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/anthropic.claude:1",
            "models": [
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-haiku-20240307-v1:0"
                },
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-5-sonnet-20240620-v1:0"
                }
            ],
            "fallbackModel": {
                "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-5-sonnet-20240620-v1:0"
            },
            "status": "AVAILABLE",
            "type": "default"
        },
        {
            "promptRouterName": "Meta Prompt Router",
            "routingCriteria": {
                "responseQualityDifference": 0.0
            },
            "description": "Routes requests among models in the LLaMA family",
            "createdAt": "2024-11-20T00:00:00+00:00",
            "updatedAt": "2024-11-20T00:00:00+00:00",
            "promptRouterArn": "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1",
            "models": [
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-8b-instruct-v1:0"
                },
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
                }
            ],
            "fallbackModel": {
                "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
            },
            "status": "AVAILABLE",
            "type": "default"
        }
    ]
}

프롬프트 라우터 ARN과 함께 GetPromptRouter를 사용하여 특정 프롬프트 라우터에 대한 정보를 얻을 수 있습니다. 예를 들어, Meta Llama 모델 제품군의 경우에는 다음과 같습니다.

aws bedrock get-prompt-router --prompt-router-arn arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1

{
    "promptRouterName": "Meta Prompt Router",
    "routingCriteria": {
        "responseQualityDifference": 0.0
    },
    "description": "Routes requests among models in the LLaMA family",
    "createdAt": "2024-11-20T00:00:00+00:00",
    "updatedAt": "2024-11-20T00:00:00+00:00",
    "promptRouterArn": "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1",
    "models": [
        {
            "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-8b-instruct-v1:0"
        },
        {
            "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
        }
    ],
    "fallbackModel": {
        "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
    },
    "status": "AVAILABLE",
    "type": "default"
}

Amazon Bedrock과 함께 프롬프트 라우터를 사용하기 위해 API 직접 호출 시 프롬프트 라우터 ARN을 모델 ID로 설정했습니다. 예를 들어, 여기서는 AWS CLI 및 Amazon Bedrock Converse API와 함께 Anthropic Prompt Router를 사용합니다.

aws bedrock-runtime converse \
    --model-id arn:aws:bedrock:us-east- 1:123412341234:default-prompt-router/anthropic.claude:1\
    --messages '[{ "role": "user", "content": [ { "text": "Alice has N brothers and she also has M sisters. How many sisters does Alice’s brothers have?" } ] }]' \

출력에서 프롬프트 라우터를 사용한 간접 호출에는 실제로 사용된 모델을 알려주는 새로운 trace 섹션이 포함됩니다. 여기서는 Anthropic의 Claude 3.5 Sonnet을 사용하겠습니다.

{
    "output": {
        "message": {
            "role": "assistant",
            "content": [
                {
                    "text": "To solve this problem, let's think it through step-by-step:\n\n1) First, we need to understand the relationships:\n   - Alice has N brothers\n   - Alice has M sisters\n\n2) Now, we need to consider who Alice's brothers' sisters are:\n   - Alice herself is a sister to all her brothers\n   - All of Alice's sisters are also sisters to Alice's brothers\n\n3) So, the total number of sisters that Alice's brothers have is:\n   - The number of Alice's sisters (M)\n   - Plus Alice herself (+1)\n\n4) Therefore, the answer can be expressed as: M + 1\n\nThus, Alice's brothers have M + 1 sisters."
                }
            ]
        }
    },
    . . .
    "trace": {
        "promptRouter": {
            "invokedModelId": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-5-sonnet-20240620-v1:0"
        }
    }
}

AWS SDK를 통해 Amazon Bedrock Intelligent Prompt Routing 사용
프롬프트 라우터와 함께 AWS SDK를 사용하는 것은 이전 명령줄 경험과 비슷합니다. 모델을 간접 호출할 때 저는 모델 ID를 프롬프트 모델 ARN으로 설정했습니다. 예를 들어, 이 Python 코드에서 ConverseStream API와 함께 Meta Llama 라우터를 사용하고 있습니다.

import json
import boto3

bedrock_runtime = boto3.client(
    "bedrock-runtime",
    region_name="us-east-1",
)

MODEL_ID = "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1"

user_message = "'Describe the purpose of a 'hello world' program in one line."
messages = [
    {
        "role": "user",
        "content": [{"text": user_message}],
    }
]

streaming_response = bedrock_runtime.converse_stream(
    modelId=MODEL_ID,
    messages=messages,
)

for chunk in streaming_response["stream"]:
    if "contentBlockDelta" in chunk:
        text = chunk["contentBlockDelta"]["delta"]["text"]
        print(text, end="")
    if "messageStop" in chunk:
        print()
    if "metadata" in chunk:
        if "trace" in chunk["metadata"]:
            print(json.dumps(chunk['metadata']['trace'], indent=2))

이 스크립트는 응답 텍스트와 추적 내용을 응답 메타데이터에 인쇄합니다. 이 복잡하지 않은 요청을 위해 프롬프트 라우터에서 더 빠르고 저렴한 모델을 선택했습니다.

“Hello World” 프로그램은 프로그래밍 언어의 기본 구문과 기능을 보여주는 기본 예로 사용되는 간단한 입문 프로그램으로, 일반적으로 개발 환경이 올바르게 설정되었는지 확인하는 데 사용됩니다.
{
  "promptRouter": {
    "invokedModelId": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-8b-instruct-v1:0"
  }
}

AWS SDK로 프롬프트 캐싱 사용
Amazon Bedrock Converse API로 프롬프트 캐싱을 사용할 수 있습니다. 캐싱할 콘텐츠에 태그를 지정하고 처음으로 모델에 전송하면 모델이 입력을 처리하고 중간 결과를 캐시에 저장합니다. 동일한 콘텐츠를 포함하는 후속 요청의 경우 모델은 캐시에서 사전 처리된 결과를 로드하여 비용과 지연 시간을 크게 줄입니다.

몇 단계만 거치면 애플리케이션에서 프롬프트 캐싱을 구현할 수 있습니다.

프롬프트에서 자주 재사용되는 부분을 파악합니다.
새 cachePoint 블록을 사용하여 메시지 목록에 캐싱하도록 이러한 섹션에 태그를 지정합니다.
응답 메타데이터 usage 섹션에서 캐시 사용 및 지연 시간 개선을 모니터링합니다.

다음은 문서 작업 시 프롬프트 캐싱을 구현하는 예입니다.

먼저 AWS 웹 사이트에서 3가지 의사 결정 가이드를 PDF 형식으로 다운로드하겠습니다. 이러한 가이드는 사용 사례에 맞는 AWS 서비스를 선택하는 데 도움이 됩니다.

그런 다음 Python 스크립트를 사용하여 문서에 대해 3가지 질문을 합니다. 코드에서 모델과의 대화를 처리하기 위해 converse() 함수를 생성하겠습니다. 함수를 처음 직접 호출할 때 문서 목록과 cachePoint 블록을 추가하기 위한 플래그를 포함합니다.

import json

import boto3

MODEL_ID = "us.anthropic.claude-3-5-sonnet-20241022-v2:0"
AWS_REGION = "us-west-2"

bedrock_runtime = boto3.client(
    "bedrock-runtime",
    region_name=AWS_REGION,
)

DOCS = [
    "bedrock-or-sagemaker.pdf",
    "generative-ai-on-aws-how-to-choose.pdf",
    "machine-learning-on-aws-how-to-choose.pdf",
]

messages = []


def converse(new_message, docs=[], cache=False):

    if len(messages) == 0 or messages[-1]["role"] != "user":
        messages.append({"role": "user", "content": []})

    for doc in docs:
        print(f"Adding document: {doc}")
        name, format = doc.rsplit('.', maxsplit=1)
        with open(doc, "rb") as f:
            bytes = f.read()
        messages[-1]["content"].append({
            "document": {
                "name": name,
                "format": format,
                "source": {"bytes": bytes},
            }
        })

    messages[-1]["content"].append({"text": new_message})

    if cache:
        messages[-1]["content"].append({"cachePoint": {"type": "default"}})

    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=messages,
    )

    output_message = response["output"]["message"]
    response_text = output_message["content"][0]["text"]

    print("Response text:")
    print(response_text)

    print("Usage:")
    print(json.dumps(response["usage"], indent=2))

    messages.append(output_message)


converse("Compare AWS Trainium and AWS Inferentia in 20 words or less.", docs=DOCS, cache=True)
converse("Compare Amazon Textract and Amazon Transcribe in 20 words or less.")
converse("Compare Amazon Q Business and Amazon Q Developer in 20 words or less.")

스크립트는 각 간접 호출에 대해 응답과 usage 카운터를 인쇄합니다.

Adding document: bedrock-or-sagemaker.pdf
Adding document: generative-ai-on-aws-how-to-choose.pdf
Adding document: machine-learning-on-aws-how-to-choose.pdf
Response text:
AWS Trainium is optimized for machine learning training, while AWS Inferentia is designed for low-cost, high-performance machine learning inference.
Usage:
{
  "inputTokens": 4,
  "outputTokens": 34,
  "totalTokens": 29879,
  "cacheReadInputTokenCount": 0,
  "cacheWriteInputTokenCount": 29841
}
Response text:
Amazon Textract extracts text and data from documents, while Amazon Transcribe converts speech to text from audio or video files.
Usage:
{
  "inputTokens": 59,
  "outputTokens": 30,
  "totalTokens": 29930,
  "cacheReadInputTokenCount": 29841,
  "cacheWriteInputTokenCount": 0
}
Response text:
Amazon Q Business answers questions using enterprise data, while Amazon Q Developer assists with building and operating AWS applications and services.
Usage:
{
  "inputTokens": 108,
  "outputTokens": 26,
  "totalTokens": 29975,
  "cacheReadInputTokenCount": 29841,
  "cacheWriteInputTokenCount": 0
}

응답의 usage 섹션에는 cacheReadInputTokenCount 및 cacheWriteInputTokenCount라는 두 개의 새로운 카운터가 포함되어 있습니다. 간접 호출의 총 토큰 수는 입력 및 출력 토큰과 캐시에서 읽고 쓴 토큰의 합계입니다.

각 간접 호출에서 메시지 목록을 처리합니다. 첫 번째 간접 호출의 메시지에는 문서, 첫 번째 질문 및 캐시 포인트가 포함됩니다. 캐시 포인트 이전의 메시지가 현재 캐시에 없기 때문에 캐시에 기록됩니다. usage 카운터에 따르면 29,841개의 토큰이 캐시에 쓰여집니다.

"cacheWriteInputTokenCount": 29841

다음 간접 호출에서는 이전 응답과 새로운 질문이 메시지 목록에 추가됩니다. cachePoint 이전의 메시지는 변경되지 않으며 캐시에서 검색됩니다.

예상대로 usage 카운터를 통해 이전에 쓰여진 것과 동일한 수의 토큰이 이제 캐시에서 읽혀지는 것을 알 수 있습니다.

"cacheReadInputTokenCount": 29841

이 테스트에서 다음 간접 호출은 첫 번째 간접 호출에 비해 완료하는 데 걸리는 시간이 55% 단축되었습니다. 사용 사례(예: 캐시된 콘텐츠가 많은 경우)에 따라 프롬프트 캐싱은 지연 시간을 최대 85%까지 줄일 수 있습니다.

모델에 따라 메시지 목록에 둘 이상의 캐시 포인트를 설정할 수 있습니다. 사용 사례에 적합한 캐시 포인트를 찾으려면 다양한 구성을 시도하고 보고된 사용량에 미치는 영향을 살펴보세요.

알아야 할 사항
Amazon Bedrock Intelligent Prompt Routing은 이제 미국 동부(버지니아 북부) 및 미국 서부(오리건) AWS 리전에서 평가판으로 사용할 수 있습니다. 평가판 기간 동안에 기본 프롬프트 라우터를 사용할 수 있으며 프롬프트 라우터 사용에 드는 추가 비용은 없습니다. 선택한 모델의 비용만 지불하게 됩니다. 프롬프트 라우터는 평가 수행, 지식 기반 사용, 에이전트 구성 등의 다른 Amazon Bedrock 기능과 함께 사용할 수 있습니다.

프롬프트 라우터에서 사용하는 내부 모델은 프롬프트의 복잡성을 이해해야 하기 때문에 지능형 프롬프트 라우팅은 현재 영어 프롬프트만 지원합니다.

Anthropic의 Claude 3.5 Sonnet V2 및 Claude 3.5 Haiku의 경우 미국 서부(오리건)에서 Amazon Bedrock의 프롬프트 캐싱 지원을 평가판으로 사용할 수 있습니다. 미국 동부(북부 버지니아)에서는 Amazon Nova Micro, Amazon Nova Lite, Amazon Nova Pro에 대해서도 프롬프트 캐싱을 사용할 수 있습니다.

프롬프트 캐싱을 사용하면 캐시되지 않은 입력 토큰에 비해 캐시 읽기가 90% 줄어듭니다. 캐시 스토리지에 대한 추가 인프라 비용은 없습니다. Anthropic 모델을 사용하는 경우 캐시에 쓰여지는 토큰에 대한 추가 비용을 지불합니다. Amazon Nova 모델에서는 캐시 쓰기에 대한 추가 비용이 없습니다. 자세한 내용은 Amazon Bedrock 요금 페이지를 참조하세요.

프롬프트 캐싱을 사용하면 콘텐츠가 최대 5분 동안 캐시되며 각 캐시 히트가 이 카운트다운을 재설정합니다. 교차 리전 추론을 원활하게 지원하기 위해 프롬프트 캐싱이 구현되었습니다. 이러한 방식으로 애플리케이션은 교차 리전 추론의 유연성과 함께 프롬프트 캐싱의 비용 최적화 및 지연 시간 단축이라는 이점을 얻을 수 있습니다.

이러한 새로운 기능을 통해 비용 효율적이고 성능이 뛰어난 생성형 AI 애플리케이션을 더 쉽게 빌드할 수 있습니다. 요청을 지능적으로 라우팅하고 자주 사용하는 콘텐츠를 캐싱하여 비용을 크게 절감하는 동시에 애플리케이션 성능을 유지하고 개선할 수 있습니다.

자세히 알아보고 이러한 새로운 기능을 지금 바로 사용하려면 Amazon Bedrock 설명서를 참조하고, 피드백은 AWS re:Post for Amazon Bedrock을 통해 보내주세요. community.aws에서 심층적인 기술 콘텐츠와 함께 빌더 커뮤니티가 Amazon Bedrock을 어떻게 사용하고 있는지 알아볼 수 있습니다.

– Danilo

Amazon Web Services 한국 블로그

Amazon Bedrock 지능적 프롬프트 라우팅 및 캐싱을 통해 비용 및 지연 시간 단축 (미리보기)

주요 링크 모음

팔로우하기