Turning audio data into insights with AssemblyAI and AWS

by AWS Editorial Team | 22 August 2024 | Thought Leadership

Phone calls, virtual meetings, podcasts, webinars, videos—audio data is everywhere, and AI is enabling businesses to extract insights from that data like never before. AssemblyAI develops machine learning (ML) models that deliver accurate speech-to-text for voice data as well as speaker detection, sentiment analysis, chapter detection, Personal Identifiable Information (PII) redaction, and more. Staff software engineer Ben Gotthold explains, “We are laser-focused on developing ML models to understand human speech with superhuman ability. In a nutshell, we have a complete AI system for customers to get the most out of their audio data.”

ML-powered audio innovation

AssemblyAI offers a variety of ML models that support a wide range of use cases. For example, a podcast or video platform can use speech recognition, speaker diarization, and summarization models to make their content more searchable. Content moderation and topic detection models can also be used to categorize and flag sensitive or harmful content. PII redaction, keyword detection, sentiment analysis, and entity detection models can be used for conversational intelligence solutions in contact centers or to analyze sales call data and help managers train new team members faster.

To operate effectively and provide excellent customer service, AssemblyAI needed an architecture that delivered in three key areas:

Scalability: with millions of requests coming in every day, AssemblyAI needed to flexibly scale to meet demand, optimize resource usage, and control costs.
Ease of deployment and iteration: AssemblyAI needed an architecture that made the deployment and continual improvement of ML models as straightforward as possible.
Security and compliance: AssemblyAI wanted to build its architecture with services and technologies designed to safeguard data and address the diverse compliance requirements of a global customer base.

AssemblyAI worked together with Amazon Web Services to build an architecture that delivered on all fronts.

Where audio becomes insight—inside AssemblyAI’s architecture

Customers first upload audio data to the AssemblyAI API, or submit a reference to that data using cloud object storage services like Amazon Simple Storage Service (Amazon S3). Gotthold explains, “After a customer has submitted their data to our API, we download it, transcode it, and it actually gets stored in Amazon S3. From there, we can send it to a variety of different models based on the customer’s use case, which includes things like speaker labeling and sentiment analysis.”

Once a customer request has been validated and the types of features that are required have been recorded, the process moves to AssemblyAI’s orchestrator, which Gotthold refers to as the “brain of the operation.” The orchestrator decides what models to call and in what order through an inference pipeline. This pipeline is comprised of multiple AWS services, including: Amazon Simple Queue Service (Amazon SQS), Amazon Elastic Container Service (Amazon ECS), and Amazon S3.

The orchestrator sends messages into Amazon SQS, a fully managed message queuing service for microservices, distributed systems, and serverless applications. This then drives the ML models running on Amazon ECS, a container orchestration that enables AssemblyAI to efficiently deploy, manage, and scale its models.

“We have dozens of models deployed. We are iterating on them constantly, deploying new versions, as well as new models” says Gotthold. Within Amazon ECS, AssemblyAI’s ML models are automatically scaled up and down as needed based on customer demand.

Optimizing resource usage and controlling costs

AssemblyAI is also using Amazon CloudWatch to monitor and respond to performance changes and optimize resource usage. Gotthold explains: “Requests are coming in all the time, millions per day, and we’re recording that in CloudWatch. We know, based on the decision engine inside our orchestrator, which models are needed and in what order they’re going to be called. So, using signals like queue depth and other custom metrics we can provision just the right amount of model workers. Popular models are going to be scaling up and down more quickly than less popular ones.”

“A great example of that would be a customer requesting speaker labeling, who is talking and when, we know that comes after converting audio into text so we can pre-scale that service such that the capacity is there right when we need it.” Beyond efficiency, optimizing resource usage also delivers savings. “In general, it’s pretty expensive to run these models on GPUs, so we really like to have good scaling to control costs,” says Gotthold.

After completing the request for the customer, a notification is sent through Amazon Simple Notification Service (Amazon SNS) to AWS Lambda, a serverless event-driven compute service that notifies the customer when their transcription is ready.

Prioritizing data security and responsible usage

AssemblyAI works with a global customer base and must adhere to strict compliance and data security standards. “There’s a lot of non-functional requirements—compliance, things like that. We’re SOC 2 Type 2 certified and really care about following best practices for storing data” says Gotthold.

AWS is architected to be the most secure global cloud infrastructure on which to build, migrate, and manage applications and workloads. AWS services like Amazon ECS and Amazon S3 empower users to securely manage data, detect potentially suspicious behavior, and minimize risk. As Gotthold explains, “we have strict lifecycle policies in Amazon S3, so we keep the data only as long as it’s useful for our orchestrator and ML pipeline.”

Enabling audio data innovation

AssemblyAI continues to innovate on behalf of its customers and empower them with new ML models. The company released LeMUR in 2023, a framework for applying large language models (LLMs) to speech data. With just a few lines of code, customers can use LeMUR to create custom summaries for multiple audio files at once, ask questions about their data with natural language prompts, recap action items from meeting recordings, and more.

By building its architecture on AWS, AssemblyAI can continue to create innovative solutions like LeMUR and uncover new ways to turn audio data into insights. All the while, the business knows it has the scalability, ease of deployment, and security features to effectively manage demand and deliver an exceptional service to its customers.

Learn more about how AWS gives your software or technology company the freedom to migrate, innovate, and scale. Contact us now to get started.

Continue your cloud journey

Register for an in-person AWS event near you

View all events

Book a free consultation to learn how you can modernize, innovate, or scale your business

Talk with an expert

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages.