Getting insights from Amazon Managed Service for Prometheus using natural language powered by Amazon Bedrock

As applications scale, customers need more automated practices to maintain application availability and reduce the time and effort spent detecting, debugging, and resolving operational issues. Organizations allocate money and developer time to deploy and manage various monitoring tools, while also dedicating considerable effort to training teams on their usage. When issues arise, operators navigate through numerous data sources like dashboards, documentation, runbooks, alerts, logs and more. This prolonged process of identifying root causes delays troubleshooting and remediation efforts, impacting application reliability and customer experience.

Generative AI can help address these challenges by leveraging its ability to process and analyze large volumes of data coming from diverse set of monitoring tools, generate insights and automate response. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API. Amazon Bedrock allows customers to experiment with and evaluate top foundation models, customize them with their data through fine-tuning and Retrieval Augmented Generation (RAG), and create agents that perform tasks using enterprise systems and data sources

Customers use Amazon Managed Service for Prometheus to securely and durably store the application and infrastructure metrics ingested from cloud, on-prem and hybrid environment. In order to derive insights from these metrics, customers rely on writing PromQL queries or Grafana. PromQL allows us to perform complex queries on time-series data and provides insightful information about application health by filtering, aggregating and manipulating metrics data in various ways. While this is great, it can be intimidating for beginners due to its complex syntax and need to understand the Prometheus data model.

In this blog post, we will review how Amazon Bedrock can be used to get answers about the metrics stored in Amazon Prometheus without having to know PromQL. By leveraging the example shared in the post, customers can generate PromQL queries based on natural language descriptions of what they want to monitor or analyze. Organizations can also analyze existing queries and get suggestions to optimize and improve the queries.

Solution Overview

Following diagram illustrates how Amazon Bedrock agent derive insights from Amazon Managed Service for Prometheus.

Architecture diagram

At a high level, the steps can be summarized as below:

The AWS managed collector scrapes the metrics from the workloads running on Amazon EKS cluster and ingest them onto Amazon Managed Service for Prometheus.
The user will use the Amazon Bedrock agent’s interface to ask questions about application’s health such as CPU utilization, Memory utilization.
Amazon Bedrock agent will generate the necessary PromQL query based on the user request and send this to the action group.
An action group defines actions that the agent can help the user perform. In this post, you will be using a Lambda function that can take the PromQL query shared by the agent, authenticate with Amazon Managed Service for Prometheus and run the query.
The action group will share the results with the agent, and this will be further enhanced using the knowledge base.
Knowledge bases for Amazon Bedrock allows you to integrate proprietary information into your generative-AI applications. Using the Retrieval Augment Generation (RAG) technique, a knowledge base searches your data to find the most useful information and then uses it to answer natural language questions. The agent will then digest the results, add proper context and present it in a natural language format back to the user.

Prerequisites

For this walkthrough, you need the following:

* AWS Command Line Interface (AWS CLI) version 2
* Amazon EKS cluster
* Amazon Managed Service for Prometheus workspace
* Amazon Managed Grafana workspace
* Access to Claude 3 Sonnet Model in Amazon Bedrock
* awscurl
* Amazon S3 bucket

Note: Though Amazon Managed Grafana will be deployed as part of the blog post, it will not be used. It’s optional.

Solution Walk-through

Step 1: Setting up the monitoring for Amazon EKS cluster using AWS Managed Collector & Amazon Managed Service for Prometheus

To get started, you will first set up the monitoring for Amazon EKS cluster. You will be leveraging Solution for Monitoring Amazon EKS infrastructure with Amazon Managed Grafana project that sets up the Amazon EKS cluster and an AWS managed collector. The collector will scrape the metrics and ingest them to pre-configured Amazon Managed Service for Prometheus workspace. The metrics collected will provide insights into the health and performance of the Kubernetes control and data plane. You will be able to understand your Amazon EKS cluster from the node level, to pods, down to the Kubernetes level, including detailed monitoring of resource usage.

Let’s start by setting a few environment variables:

export AMG_WORKSPACE_ID=<Your grafana workspace id usually starts with g-> 

export AMG_API_KEY=$(aws grafana create-workspace-api-key \ --key-name "grafana-operator-key" \ --key-role "ADMIN" \ --seconds-to-live 432000 \ --workspace-id $AMG_WORKSPACE_ID \ --query key \ --output text)

After creating the API key, you must make it available to the AWS CDK, by adding it to AWS Systems Manager with the following command. Replace $AMG_API_KEY with the API key that you created, and $AWS_REGION with the Region that your solution will run in.

aws ssm put-parameter --name "/observability-aws-solution-eks-infra/grafana-api-key" \ --type "SecureString" \ --value $AMG_API_KEY \ --region $AWS_REGION \ --overwrite

Next, you will be deploying the observability stack using AWS CDK.

git clone https://github.com/aws-observability/observability-best-practices.git
cd observability-best-practices/solutions/oss/eks-infra/v3.0.0/iac/

export AWS_REGION=<Your region> 
export AMG_ENDPOINT=<AMG_ENDPOINT >
export EKS_CLUSTER_NAME=<EKS_CLUSTER_NAME>
export AMP_WS_ARN=<ARN of Amazon Prometheus workspace>
make deps
make build && make pattern aws-observability-solution-eks-infra-$EKS_CLUSTER_NAME deploy

This solution creates a scraper that collects metrics from your Amazon EKS cluster. Those metrics are stored in Amazon Managed Service for Prometheus, and then displayed in Amazon Managed Grafana dashboards.

To verify the stack is deployed successfully, we can use the awscurl to query the Amazon Prometheus workspace and confirm if the metrics are getting ingested:

export AMP_QUERY_ENDPOINT=<AMP Query Endpoint>
 awscurl -X POST --region <Your region> \ --service aps "${AMP_QUERY_ENDPOINT}" -d 'query=up' --header 'Content-Type: application/x-www-form-urlencoded'

You should see a result such as:

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "up",
          "instance": "localhost:9090",
          "job": "prometheus",
          "monitor": "monitor"
        },
        "value": [
          1652452637.636,
          "1"
        ]
      },
    ]
  }
}

Step 2: Configure the Lambda function as Action Groups for the Amazon Bedrock agent

Next, you will create a lambda function as an action group for the Amazon Bedrock agent. Amazon Bedrock will translate the user’s natural language query to promql query and parse it to the lambda function which runs the query with Amazon Prometheus and returns the value.

git clone https://github.com/aws-samples/Amazon-prometheus-bedrock-agent-example.git

export AMP_WORKSPACE_ID=<AMP WORKSPACE ID>
export AMP_REGION=<Your region>
export ACCOUNT_ID=`aws sts get-caller-identity --query "Account" --output text`


cd Amazon-prometheus-bedrock-agent-example/
python3 ./stage.py --amp-workspace-id ${AMP_WORKSPACE_ID} --amp-region ${AMP_REGION}

The above script creates a Lambda function with prefix amp-bedrock-agent and all the necessary policies to execute that.

Step 3: Configure the Bedrock Agent

Next, you will create an Amazon Bedrock agent that will serve as the interface to gather the queries in natural language format and invoke the lambda function as Action Group to query Amazon Prometheus.
To create an agent with Amazon Bedrock, you set up the following components:

The configuration of the agent, which defines the purpose of the agent and indicates the foundation model (FM) that it uses to generate prompts and responses.
Action groups that define what actions the agent is designed to perform or a knowledge base of data sources to augment the generative capabilities of the agent by allowing search and query.

Let’s navigate to the Amazon Bedrock console and create an agent:

Creating Bedrock Agent

We will use Anthropic’s Claude 3 Sonnet for the agent and will create an Action groups . We should ensure we have access to this model as explained in the prerequisites section

Agent Builder

Next, let’s assign the Action groups to the agent. You will select the lambda function we created in the previous step amp-bedrock-agent.

Creating Action groups

Next, you will create the Action group function that specifies the business logic for the action group and the parameters to use. For querying Amazon Prometheus, you need to define the right API, promql query, and time range.

Action group function

You can enrich the response of the agent by providing additional context using knowledge base. In this step, you will add a knowledge base which will be stored in an S3 bucket.

export S3_Bucket=<Your S3 bucket>
cd Amazon-prometheus-bedrock-agent-example
aws s3 cp promql-query-eks-essential-metrics.docx s3://$S3_Bucket/ --recursive

First you will create a knowledge base in the Amazon Bedrock console.

Bedrock Knowledge base creation

In the next step, you will use the S3 bucket as a data source of the knowledge base.

Configuring Datasource

Next, you will associate the knowledge base to the Amazon Bedrock agent. You will select the Agents from the left navigation pane and edit in Agent builder and add the knowledge base as below:

Associating knowledgebase

Step 4: Deriving insights using Amazon Bedrock agent

Now that the setup is complete, you can use the Bedrock agent’s interface to query Amazon Prometheus and get insights without having to build a PromQL query.
First, let’s ask the Bedrock agent to get an overview of the Amazon EKS cluster:

Cluster-Overview

Now, let’s try to get the resource utilization information of kube-system namespace and identify which application is consuming the most CPU

CPU utilization

Let’s check the API server’s performance :

API server performance

AWS Lambda captures logs for all requests handled by your function and sends them to Amazon CloudWatch Logs . You can explore these logs to see how the natural language questions are being transformed to PromQL queries by Amazon Bedrock.

start _amp_query_params: {'function': 'query', 'parameters': [{'name': 'query', 'type': 'string', 'value': 'sum(increase(apiserver_request_total[10m]))'}], 'sessionId': '209466391465695', 'agent': {'name': 'operations-assistant-v1', 'version': 'DRAFT', 'id': 'VFACM92VM1', 'alias': 'TSTALIASID'}, 'sessionAttributes': {}, 'promptSessionAttributes': {}, 'inputText': 'Whats the total request made to the API server in the last 10 minutes?', 'actionGroup': 'query', 'messageVersion': '1.0'}

Let’s try to do some troubleshooting now. Let’s use the agent to identify if there were any recent container restarts:

Container Restarts

Cleaning up

To delete the resources provisioned in this post, please run the following commands.

cd observability-best-practices/solutions/oss/eks-infra/v3.0.0/iac/
make pattern aws-observability-solution-eks-infra-$EKS_CLUSTER_NAME destroy
#Delete Lambda function
aws lambda delete-function \
    --function-name bedrock-amp-query-agent-function

#Delete the Amazon Bedrock Agent

aws bedrock-agent delete-agent <agent-id>

Conclusion

In this post, we demonstrated how you can leverage Generative AI services such as Amazon Bedrock to derive insights from telemetry signals stored in Amazon Managed Service for Prometheus using natural language and understand the application and infrastructure health without having to write complex PromQL queries.

For more information, visit the following resources:

Hands-on Observability workshop
AWS Observability Best Practices
AWS Observability Accelerator

AWS Cloud Operations Blog