Building a voice interface for generative AI assistants

Generative AI is revolutionizing how businesses interact with their customers through natural conversational interfaces. While organizations can implement AI assistants across various channels, phone calls remain a preferred method for many customers seeking support or information.

We’ll demonstrate how to create a voice interface for your existing Amazon Bedrock generative AI assistant, enabling customers to engage in phone-based conversations with your AI implementation.

Solution overview

Using Workflow Studio for Amazon Web Services (AWS) Step Functions, we built a voice communication interface that connects with the Amazon Nova Micro model in Amazon Bedrock (Figure 1). The demo application uses the base model to enable open-ended questions. Organizations can implement either Amazon Bedrock Agents or Flows to address specific business requirements.

A Step Functions workflow diagram illustrating a voice communication system integrated with Amazon Bedrock. The workflow shows a sequential process starting with call handling, followed by parallel branches: one for managing hold music and another for processing voice input through Amazon Transcribe and Amazon Nova Micro model. The diagram demonstrates the complete call flow from initial welcome message through question-answer cycles to call completion.

Figure 1 – Step Functions workflow that enables voice communication to a generative AI assistant

How it works:

Inbound call arrives
System plays welcome message
System asks caller for questions
Voice recording starts, stopping when silence is detected
Parallel flows begin:
- First flow
  1. Plays some music while the caller is on-hold
- Second flow
  1. Transcribes the recording using Amazon Transcribe
  2. Sends transcribed question to the Amazon Nova Micro model in Amazon Bedrock
  3. Upon receiving the response, stops the on-hold music
Text-to-speech plays the model’s answer
System asks for additional questions and loops to Step 4 or ends the call

Expanded capabilities and optimizations

These are potential improvements, additional functionalities, and advanced features that can enhance the demo application:

The transcription component is interchangeable with any speech-to-text generative AI model (including Whisper Large V3 Turbo on Amazon Bedrock Marketplace)
The PSTN audio service RecordAudio Action can be tuned to adjust silence duration and background noise levels
Enabling the PSTN audio service VoiceFocus feature to improve call clarity by reducing background noise and enhancing voice quality
PSTN audio service Session Initiation Protocol (SIP) media applications can also handle calls through SIP trunking by using Amazon Chime SDK Voice Connector, streamlining integration with existing phone systems
The UpdateSipMediaApplicationCall API is a PSTN audio service feature that lets you regain call control and apply new actions during active calls
Parallel workflow states allow user-friendly handling of API service calls by playing music during processing
PSTN audio service provides pay-per-minute rates with serverless, scalable telephony infrastructure

Deploying the solution

The following steps allow you to deploy the voice communication interface workflow (Figure 1) together with the supporting serverless architecture for Step Functions and PSTN audio service integration. In a previous blog, we demonstrated how combining Step Functions and Amazon Chime SDK PSTN audio service streamlines the development of reliable telephony applications through a visual workflow design.

Prerequisites:

AWS Management Console access
Node.js and npm installed
AWS Command Line Interface (AWS CLI) installed and configured
Enable access to the Amazon Nova Micro model through the Amazon Bedrock console

Walkthrough:

The AWS Cloud Development Kit (AWS CDK) project on the AWS GitHub repository will deploy the following resources:

phoneNumberBedrock – Provisioned phone number for the demo application
sipMediaApp – SIP media application that routes calls to lambdaProcessPSTNAudioServiceCalls
sipRule – SIP rule that directs calls from phoneNumberBedrock to sipMediaApp
lambdaProcessPSTNAudioServiceCalls – AWS Lambda function for call processing
roleLambdaProcessPSTNAudioServiceCalls – AWS Identity and Access Management (IAM) Role for lambdaProcessPSTNAudioServiceCalls
stepfunctionBedrockWorkflow – Step Functions workflow for the telephony application
roleStepfuntionBedrockWorkflow – IAM Role for stepfunctionBedrockWorkflow
s3BucketApp – Amazon Simple Storage Service (Amazon S3) bucket for storing customer questions recordings
s3BucketPolicy – IAM Policy granting PSTN audio service access to s3BucketApp
lambdaAudioTranscription – Lambda function for audio transcription
lambdaLayerForTranscription – Lambda layer required for lambdaAudioTranscription
roleLambdaAudioTranscription – IAM Role for lambdaAudioTranscription

Follow these steps to deploy the CDK stack:

Clone the repository.

git clone https://github.com/aws-samples/sample-chime-sdk-bedrock-voice-interface
cd sample-chime-sdk-bedrock-voice-interface
npm install

Bootstrap the stack.

#default AWS CLI credentials are used, otherwise use the –-profile parameter
#provide the <account-id> and <region> to deploy this stack
cdk bootstrap aws://<account-id>/<region>

Deploy the stack.

#default AWS CLI credentials are used, otherwise use the –-profile parameter
#phoneAreaCode: the United States area code used to provision the phone number
cdk deploy –-context phoneAreaCode=NPA

Call the provisioned phone number to test the sample application.

Cleaning up:

To clean up this demo, execute:

cdk destroy

Conclusion

We demonstrated how organizations can add voice capabilities to their existing generative AI implementations using Amazon Bedrock. The solution enables customers to interact with AI assistants through traditional phone calls, expanding accessibility and user engagement. The demo application showcases an architecture combining AWS Step Functions and Amazon Chime SDK PSTN audio service, delivering natural voice conversations with AI models through quick deployment using visual workflows.

Organizations benefit from cost optimization with pay-per-minute pricing, enterprise-ready telephony integration through PSTN or SIP trunking, and automatic scaling to match customer demand. This foundation enables businesses to build practical AI applications ranging from all day customer service agents, to multi-language support services, and knowledge base assistants. By following this solution, you can quickly extend your generative AI investments to voice channels, providing more value to your customers while maintaining operational efficiency.

Contact an AWS Representative to know how we can help accelerate your business.

Select your cookie preferences

AWS Messaging & Targeting Blog