AWS Cloud Operations Blog
Event Driven Architecture using Amazon EventBridge – Part 2
This post is co-authored with Andy Suarez and Kevin Breton (from KnowBe4).
This blog post continues the discussion from Event-Driven Architecture using Amazon EventBridge – Part 1. The previous post covered the adoption and design of an event-driven architecture by KnowBe4, a leading security awareness training provider. In this post, we highlight the development and implementation of a new KnowBe4 product feature, the Audit Log service. This new service tracks changes in the KnowBe4 Security Awareness Training (KSAT) products. It provides the ability for users to view what changes occurred in their system, which user made the change and when the change happened. This is all achieved utilizing an event-driven architecture with an Amazon EventBridge subscriber and a GraphQL API using AWS AppSync.
As part of the initial event-driven architecture adoption discussions, KnowBe4 determines a list of services that would be ideal candidates to be the first adopters of the Audit Log Service. The vision is to have various services publish audit events to an Event Bus. Audit Log Service will subscribe to these events, persist them to a data store, and provide an API for querying the Audit Log. Figure 1 shows the application architecture.
Data requirements
The initial concept involves persisting audit events in a data store for 180 days, and potentially moving them to a cold data storage after the time period has elapsed. KnowBe4 has three teams that want to use this Audit Log Service, and they have estimated the following number of events for each team.
Team | Events Per Day | 180 days of Events |
KSAT | 27 million | 5 billion |
PhishER | 222 million | 40 billion |
Internal tools | 5.5 thousand | 1 million |
The base event schema has a metadata section that includes the detail-type (audit event type), principal (who did it), resource (what it happened to) and the timestamp of the event. However, each of these teams have their own requirements for fields they want to filter.
Data store selection
KnowBe4 explored several options and developed prototypes for two of those solutions, utilizing different data storage approaches.
Amazon DynamoDB | For a better user experience, the product team at KnowBe4 wanted to avoid limiting the query access patterns for their users. Without being able to anticipate all potential access patterns upfront, creating too many secondary indexes risked increasing database complexity and maintenance. | |
Amazon Redshift Serverless | Amazon Redshift is well-suited for complex analytical queries, providing very fast response times when working with large JSON datasets. An alternative option to consider is Amazon Aurora Serverless v2, which offers a flexible, serverless database architecture that may be able to achieve comparable performance for the team’s specific JSON data processing needs. | |
Amazon OpenSearch | The key requirements for the datastore were data storage, querying, and some baseline analytics. OpenSearch seems to offer more features than those required for this workload, including Natural Language Processing. The AWS cost calculator estimated that using OpenSearch would cost more than using provisioned Amazon Aurora. | |
Amazon Aurora PostgreSQL Provisioned | Amazon Aurora PostgreSQL is a good fit for this project’s requirements. KnowBe4 team is mature and confident in their usage of Aurora. This solution meets requirements such as multiple indexes and ability to add new indexes based on new query patterns. | |
Amazon Aurora Serverless v2 | This is a great fit. In addition to benefits like multiple indexes and the ability to add new indexes based on new query patterns, Amazon Aurora Serverless also provide dynamic scaling. Most of the audit events should be published during business hours. The Aurora instance can scale down during other hours and potentially save money. It also provides low operational overhead. |
Amazon Aurora Serverless v2
Amazon Aurora Serverless v2’s flexibility and scalability made it the best choice for the project. It allowed for cost optimization by scaling based on usage and leveraged the KnowBe4 team’s expertise with Aurora. The backup plan was to move over to Amazon Aurora Provisioned if the pricing or auto-scaling is unacceptable. The team configured their Aurora Serverless v2 instance from 0.5-10 Aurora Capacity Units (ACUs). The system architecture includes a Writer instance and a separate Reader instance. The Event Subscriber component, written in Rust, is responsible for performing bulk insertions of records into the database. To manage the database connections, the Event Subscriber leverages Amazon RDS Proxy.
During the load testing phase, the team found that the Amazon Aurora Serverless v2 instance was able to scale up effectively to handle the increased workload. However, the scale down process took several hours to complete and the instance did not go below 4 ACUs, even when the load had decreased (Figure 2).
Diving deep, the team found Amazon RDS Proxy was keeping connections open to the RDS instance. Adjusting the PostgreSQL idle_session_timeout to 5 minutes and reducing the RDS Proxy idle_client_timeout ensured the RDS Proxy connections would automatically close instead of staying open, which significantly improved scaling down (Figure 3).
The ACUs did not scale down to 0.5 ACUs with a Reader and a Writer instance. The lowest scale down experienced was about 2 ACU for each of the Reader and Writer instances.
A quick analysis compared a provisioned db.r7g.large for an equivalent yearly price of a scaled down Aurora Serverless v2 instances. The cost analysis suggests that continuously running two instances of Aurora Serverless v2 database with 2 Aurora Capacity Units (ACU) was more cost-effective than using two provisioned Amazon RDS instance with a db.r7g.large instance type. Furthermore, with variable workloads, the costs of using Aurora Serverless would be even lower in comparison.
2 instances * 2 ACU * 24 hours * 365 days * $0.12 = $4,204.8 per year minimum
2 db.r7g.large instances * 24 hours * 365 days * $0.276 = $4,835.52 per year
Query performance – partitioning
Query performance is a critical design consideration for KnowBe4. The team recognized the need for the system to efficiently handle queries searching through billions of records, as the users of the associated web application will expect responsive interactions. Providing a highly performant user experience is a key priority, given the massive scale of data that will be queried. To address this challenge, the team has identified database partitioning as a promising approach to improve query performance. By intelligently partitioning the database tables, the system will be able to quickly locate and retrieve the relevant data, even when handling queries across billions of records. This partitioning strategy will be a core part of the overall architecture, designed to deliver the level of responsiveness that users will demand from the application.
KnowBe4 decided to partition by day and to use pg_partman and pg_cron to automate partition creation on a daily basis. Each query is required to include one or more account ID and a date range. The date range should be reasonable as to prevent querying the entire table.
The team investigated sub-partitioning based upon an account range. However, the pg_partman documentation recommends against this. “Sub-partitioning with multiple levels is supported, but it is of very limited use in PostgreSQL and provides next to NO PERFORMANCE BENEFIT outside of extremely large data in a single partition set (100s of terabytes, petabytes).”
Database schema migrations
KnowBe4 choose Flyway to manage database schema versions. Flyway is an open source database version control tool that facilitates easy database migrations. It automates the creation, change and rollback of schema changes across various development environments. With Flyway, teams can track changes to the database schema in a simple, versioned manner through SQL migration scripts.
KnowBe4 uses the Flyway CLI in their CI/CD pipelines. This helps to ensure clean deployments while preventing data corruption issues during production rollouts. Overall, Flyway is a simple yet powerful tool that helps developers continuously evolve database schemas with confidence across all stages of the software development lifecycle.
Audit Log service design
The Audit Log service utilizes an event-driven architecture. Various systems generate audit log events, which are sent to Amazon EventBridge. EventBridge then evaluates these events against its rules and forwards the matching events to the corresponding Amazon SQS queues. An Audit Log Lambda function reads the audit log events from these SQS queues and persists the data into an Amazon Aurora database. Figure 4 shows the high level Audit Log service architecture.
KnowBe4 decided to implement an Amazon Aurora PostgreSQL database for each product producing the events. The primary goal was to isolate each product such that it would not impact the others. Aurora PostgreSQL instances can also be sized depending upon each system’s requirements and provide customizations for each product such as indexes.
KnowBe4 built an Audit Log event subscriber with AWS Lambda using Rust runtime. This parses the events and performs the bulk inserts, so the team created Amazon EventBridge rules to target different Amazon SQS queues based upon the event source. This allows the Lambda to process a group of messages from the same source without extensive if/else handling per message.
The Lambda leverages the Rust Factory Method design pattern to instantiate a handler based upon the event source. Each handler knows the specific event model that the source publishes and can easily process the events to insert into the database.
The ability to query the data is provided by a GraphQL implementation written in a Python Lambda function. Python was chosen for its vast libraries and frameworks, rapid development, and seamless integration with various AWS services. KnowBe4 also leveraged AWS Lambda Powertools for Python to help speed up development with known good patterns such as tracing, Logging, and AppSync Event Handler.
To provide secure access to the GraphQL queries, the solution leverages AWS WAF (Web Application Firewall) and a custom JWT (JSON Web Token) Authorizer Lambda function. This multi-layered security approach ensures that only authorized users can interact with the GraphQL API. Furthermore, the GraphQL data models are directly integrated with AWS AppSync resolvers, which have a deep understanding of the underlying database model. This tight coupling between the GraphQL schema and the database structure simplifies the implementation and enables efficient data retrieval.
Modernization Experienced Based Acceleration (ModAx EBA)
While the development of the Data Foundation and Audit Log service was a substantial internal effort, KnowBe4 partnered with AWS on rapid execution of these initiatives by leveraging the Modernization Experience-Based Acceleration program (ModAx EBA) for sprint planning, workstream alignment, and subject matter expertise. AWS Experience-Based Acceleration (EBA) is the methodology that underpins ModAx, which combines an application assessment, enablement, and execution process into a six to eight-week sprint to develop long-term modernization muscle within teams. KnowBe4 has extensive experience with AWS and advanced use cases, and the ModAx EBA offered an opportunity to collaborate across development teams and leverage AWS Solutions Architect Specialists specific to the technologies KnowBe4 was exploring.
The ModAx EBA initiative included several parallel workstreams. These workstreams encouraged rapid development by bringing together all necessary engineers and stakeholders. Teams were then aligned to the specific goals and success criteria for their workstream’s execution.
- Cloud Security Workstream. This workstream focused on addressing security concerns around data entitlement, ensuring that data sent to consumers is appropriately scoped and restricted based on their permissions. The key goals included developing mechanisms to limit which subscribers can receive certain events. For example, if a user-created event is published and a service called X subscribes to that event, the system should check the customer account’s configurations. If the customer has not enabled service X, then the event should not be sent to that service, even though it has subscribed. This would prevent unauthorized access to data that the customer has not granted permissions for.
The cloud security workstream was tasked with designing and implementing these access control and entitlement management capabilities to uphold the principle of least privilege and ensure that data is only shared with the appropriate subscribers based on the customer’s defined policies and entitlements.
- Producers and Consumers Workstream. This workstream was focused on demonstrating various use cases that leverage the event bus capabilities across the KnowBe4 platform. The goals included showcasing how different components within the platform can act as both producers and consumers of events, highlighting the versatility and applicability of the event-driven architecture.
- Event Handlers Workstream. This workstream was aimed to provide support for both local and AWS environments, create Infrastructure as Code (IaC) scripts to enable automated and consistent deployments, remove any roadblocks or access-related issues from an AWS standpoint, and quickly adapt the implementation based on feedback from the other workstreams. The core focus was to build a robust and flexible event handling system that can seamlessly operate in different deployment environments, while also ensuring the necessary infrastructure and access controls are in place. The team worked iteratively, making adjustments as needed to address the requirements and feedback from the other workstreams.
- Audit Log Workstream. This workstream built an Audit Log Event Subscriber for persisting events to a datastore and design and implementation of frontend page for Audit Log. An integration with Audit Log backend API services was also built to pull data and display to end users. It displayed the end-to-end cycle of audit events being created and shown in the Audit Log and provided robust end user filtering of events.
To ensure the progress of all workstreams, a Command Center was established that guided the EBA’s goals, blockers, and accomplishments. The core functions of the Command Center workstream were to monitor resource and skill needs by workstream and address needs for escalations, cross-workstream blockers, and decisions as they arise. The participants met twice a week for 5 weeks to develop the target architecture and discover all prerequisites needed for successful implementation. The 3 day workshop was held to execute the goals of each cross-functional workstream and deliver on the EBA’s defined success criteria.
ModAx EBA Accomplishments
The ModAx EBA proved that the event-driven architecture was a stable backbone for decoupling of services and delivered the desired design and performance for Audit Log. The execution was delivered to KnowBe4’s standards of engineering best practices, with security top of mind. During the ModAx EBA, the system was able to receive targeted events from the bus that allowed KnowBe4 to decouple the services and migrate the workloads.
The Audit Log system created the frontend/backend services to enable end-users to self-serve audit events. Additionally, KnowBe4 was able to gain deep insights into specific service details such as Lambda, Aurora PostgreSQL and Amazon EventBridge.
Some of the quotes captured during the ModAx EBA:
“The ModAx EBA process brought our teams into contact with AWS engineers in an environment tuned for rapid prototyping, which definitely accelerated getting this project released sooner than planned.”
– KnowBe4 EVP of Engineering
“Having the ability to get down into the details of limits, cost, reliability, and trade-offs live with all the stakeholders proved to be an invaluable experience.”
– KnowBe4 SVP of Engineering
Conclusion
KnowBe4’s adoption of event-driven architecture has proven to be a stable backbone, allowing them to decouple services, reduce complexity, and ship new features quarters ahead of schedule. The execution adhered to KnowBe4’s engineering best practices, with security as a top priority.
Amazon EventBridge’s scalable and serverless event bus enabled KnowBe4 to integrate different applications and services without complex integrations. During the ModAx EBA process, the team received targeted events from Amazon EventBridge, further decoupling services and migrating workloads. The Audit Log system created frontend and backend services for end-users to self-serve audit events. KnowBe4 also gained valuable insights into AWS services like Lambda, RDS, and EventBridge.
The ModAx EBA process was invaluable, allowing close collaboration with AWS engineers in a rapid prototyping environment, accelerating the project’s release. Discussions of limits, costs, reliability, and trade-offs with stakeholders proved to be an equally valuable experience.