AWS Storage Blog

How Vivian Health is using Amazon S3 Express One Zone to accelerate healthcare hiring

Vivian Health connects travel nurses with job opportunities across the country. To do that, the platform has innovated not just the job search itself, but also the tooling used by recruiters and hiring managers to get qualified candidates matched to the right job and placed as quickly and as seamlessly as possible. However, the process of sourcing and posting these jobs is complex, involving multiple systems. This often leads to delays, inaccuracies, and a less-than-optimal experience for the nurses.

To address these issues, Vivian Health has started forming direct partnerships with upstream job systems, enabling them to access and post job openings more quickly and accurately, providing a better experience for candidates seeking employment. These new partnerships lead to large data influxes to Vivian’s systems requiring a solution to handle them effectively. Amazon S3 Express One Zone provides the necessary scalability and performance to handle the large influxes of data from the new Shared Job Systems integrations, allowing Vivian Health to rapidly process and make use of this data for innovative features such as Vivian’s AI Copilot.

In this post, we explain how Vivian used S3 Express One Zone to scale its recruitment platform by improving the performance of massive data ingestion from various sources. S3 Express One Zone helped reduce storage latency by 75% and optimize compute usage, enabling Vivian to reduce the TCO of its solution by 71%, and enhance the recruitment process by providing timely and accurate data. We explore why Vivian Health chose this solution and the technological challenges it helped solve, demonstrating how these innovations have transformed the job search landscape for healthcare professionals.

How travel healthcare jobs get posted on Vivian Health today

Vivian Health is a platform that connects travel nurses with job opportunities across the country by aggregating job listings from various partnered staffing agencies. Behind the scenes, healthcare facilities have ongoing relationships with Shared Job Systems allowing them to privately post their temporary job openings (1). Staffing agencies then receive these job postings through their own automated and manual integrations with the Shared Job Systems (2). Finally, Vivian’s platform surfaces tens of thousands of these jobs from hundreds of agencies through automated integrations or through agency recruiters manually creating or updating jobs via the web application (3).

Vivian also introduced a supplemental data flow (4) between Shared Job Systems and Vivian Health. Having direct access to the Shared Job Systems allows Vivian Health to present candidates with all of the agencies offering the same job. This enables candidates to make apples-to-apples comparisons and pick the offer that makes the most sense for them. The data from these systems also creates the opportunity for Vivian Health to provide accurate answers to plain language questions about the role when the agency’s recruiter is unavailable to do so through Vivian’s AI Copilot feature. As such, direct live access to this data is highly valuable and critical to improving travel nurses’ hiring experience.

Job data flow from healthcare facilities to Vivian and directly between Shared Job Systems and Vivian

Job Data Flow from healthcare facilities to Vivian and directly between Shared Job Systems and Vivian

Initial data ingestion challenges with the Job Data Flow

In the single-instance solution, the introduction of supplemental data brought with it data ingestion levels that the Job Data Flow architecture needed to be able to handle without bottlenecks. Vivian Health takes pride in nimble engineering and favors tooling and approaches that allow for solving problems with simple and cost-effective architectures that a small number of engineers can manage and maintain. Following that principle, Vivian sought to add a dedicated storage layer to the ingestion solution to decouple the ingestion and consumption processes and to improve overall reliability and scalability. The storage layer also needed to enhance data availability, durability, and redundancy, ensuring the ingested data was readily accessible and secure. Initially, Amazon S3 Standard was chosen as the storage layer due to industry leading scalability, data availability, security, and performance.

Initial data ingestion solution for the Job Data Flow using Amazon S3 Standard for the storage layer

Initial data ingestion solution for the Job Data Flow, using Amazon S3 Standard for the storage layer

In the data ingestion workflow, a 30-minute burst of data ingestion activity is supposed to occur in Amazon Data Firehose at the start of each hour. For monitoring purposes, this burst acts as a proxy to the pattern of data access on the Amazon S3 bucket where the data was concurrently being loaded. Before assessing how the S3 Standard storage layer would hold up with the supplemental data included in ingestion, Vivian first tested ingestion without supplemental data to see if the data ingestion bursts occurred as desired. As you can see in the following diagram, the 30-minute bursts of data ingestion activity occurred without issue when the supplemental data was turned off, with S3 Standard holding up well:

30-minute bursts of Amazon Firehose PUT requests coming from the solution prior to retrofitting for Shared Job Systems

30-minute bursts of Amazon Firehose PUT requests coming from the solution WITHOUT supplemental data from Shared Job Systems

With supplemental data turned on, and with the number of raw files that needed to be processed and loaded increased significantly, the hourly ingestion period, which was initially limited to 30 minutes, unexpectedly extended to over an hour. This means that the compute instance in-memory queueing system couldn’t fully empty by the time the next hourly ingestion task started. This caused a dramatic buildup in tasks and eventually Out-of-Memory (OOM) crashes of the Python Celery queue workers within hours of introducing the supplemental data. The immediate impact of this would be that the data that the Vivian AI Copilot worked off wouldn’t be as fresh as it could be, with the chatbot not fully able to make the most of the supplemental data Vivian wanted to ingest.

The same Amazon Firehose PUT requests metrics for the solution after the Shared Job Systems were added to the ingestion workload showing no pause in ingestion activity and eventually abrupt silence indicating a system crash.

The same Amazon Firehose PUT requests for the solution WITH the Shared Job System data added to the ingestion workload, showing no pause in ingestion activity and eventually abrupt silence indicating a system crash

The extended data ingestion period and subsequent issues were caused by a bottleneck during the data ingestion process, with the bottleneck itself caused by the data being split across 400k-600k individual files. For each individual file, the architecture of the software on the compute instance required additional metadata processing and clerical work before and after uploading the file. The cumulative impact of this per-file overhead led to ingestion times exceeding the initial 30-minute window, causing backlogs and system crashes. The large number of files and per-file processing requirements were the root cause of the performance issues.

Considering solutions to address ingestion challenges

At an impasse, the team had a few paths forward:

  • Vertically scale compute resources to be able to cope for the latency of traditional S3 combined with the per-task and per-file metadata.
  • Build out a new system specific to this new project for Shared Job Systems
  • Refactor the existing software architecture to support higher throughput with the same amount of resources

The team’s first attempt at remediation was vertical scaling. Ultimately, by the time Vivian got up to an instance that could handle the new workload without OOMs and backlogging, compute costs increased by 4.5x from the original solution’s costs to a peek of roughly $213 per day in the December 2023 timeframe.

The USD cost of the Amazon EC2 instance used for this single-instance solution; each new color represents an upgrade of the instance size in order to gain more on-machine parallelism using more CPU cores.

The USD cost of the Amazon EC2 instance used for this single-instance solution; each new color represents an upgrade of the instance size in order to gain more on-machine parallelism using more CPU cores.

During the same timeframe (December 2023), the S3 storage costs were roughly $2.50/day and the request costs were roughly $80/day. This led to a total cost of ownership for the retrofitted solution that was vertically scaled to be $8.9k/month, significantly higher than the team planned for, even without factoring in significant potential future growth.

Not wanting to slow down the progress of this new initiative by performing a significant refactor of the existing system, or building a new solution from scratch, the team was left to scrutinize existing components and identify opportunities to bring costs down and future-proof the system.

Using Amazon S3 Express One Zone to solve cost and performance challenges

With S3 Express One Zone, Vivian discovered that while the storage costs are slightly higher than S3 Standard, the request costs are up to 50% lower in comparison. Additionally, the accelerated performance (up to 10x faster than S3 Standard) had the potential to deliver faster processing and drive reduction in total cost since the same number of CPU cores could achieve higher throughput of Python Celery tasks.

Final data ingestion solution for the Job Data Flow using Amazon S3 Express One Zone for the storage layer

Final data ingestion solution for the Job Data Flow using Amazon S3 Express One Zone for the storage layer

To switch the ingestion component of the solution to S3 Express One Zone, the team created a new S3 directory bucket, upgraded the ingestion server’s boto3 SDK to the latest version (1.34.1), and updated the server’s config file with the new directory bucket. On the consuming ECS task side (Vivian AI Copilot) running NodeJS with the AWS SDK v2 (2.1613.0), Vivian only had to modify the configured bucket name and make a small tweak to the S3 client initialization logic to customize the endpoint used by the client to be S3 Express One Zone.

With S3 Express One Zone, Vivian was able to reduce storage latency by 75% and scale down compute resources to a smaller, cheaper instance type (m6in.4xlarge), bringing the compute costs down to $27/day (from $213/day). As can be seen in the following chart, the request costs were also cut in half (from $80 to $40).

Comparison of daily cost for S3 Stadard versus S3 Express One Zone

Granular solution costs comparing S3 Standard vs. S3 Express One Zone

As a result, the overall solution cost went from $8.9k/month to $2.6K/month, which is a reduction of approximately 71%.

Comparison of total cost of ownership for S3 Stadard versus S3 Express One Zone

Total cost of ownership per month comparing S3 Standard vs. S3 Express one Zone

While Vivian partners ingest more data into Shared Job Systems, the scalability of S3 Express One Zone ensures that the Vivian solution can continue handling step-ups and rapid shifts of data volumes without rearchitecting or scaling up compute infrastructure.

The real-time chat context for Vivian AI Copilot benefits significantly from the low-latency capability of S3 Express One Zone. By providing single-digit millisecond access to the latest job data, S3 Express One Zone enables the AI Copilot to deliver quick and accurate responses to candidates’ queries about job postings, requirements, benefits, and other related information. This enhanced responsiveness and freshness of data is critical for Vivian’s goal of creating a seamless and transparent hiring experience for healthcare professionals.

Importantly, the scalability of S3 Express One Zone also allows Vivian to handle growing data volumes from new Shared Job System integrations without compromising the accuracy or timeliness of the Copilot’s responses. As Vivian expands its partnerships and ingests increasingly large amounts of job data, S3 Express One Zone provides the necessary performance and cost-effective storage to power the Copilot’s capabilities, ensuring a consistently high-quality experience for users. This allows Vivian to focus on innovating the AI assistant’s features rather than worrying about the underlying infrastructure.

Future direction

With the scalable and cost-effective data ingestion solution powered by S3 Express One Zone now in place, Vivian is poised to further expand the reach and capabilities of its Vivian AI Copilot. The team plans to roll out the Copilot to additional customers outside of the current beta testing groups in the coming months, confident in the system’s ability to handle growing data volumes and user demands.

The team at Vivian is excited about additional opportunities in the future, in particular as Vivian continues to explore and expand AI features by hosting and training their own models, it is anticipated that the low latency aspect of S3 express One Zone will allow high performance operation, training, and iteration of such models compared to other available storage backends.

Conclusion

The core purpose of Vivian Health is to consistently enhance the experience for healthcare professionals seeking the best possible job placements in a fast-paced and unpredictable market. By transitioning from traditional S3 to S3 Express One Zone, Vivian gains a significant level of flexibility, enabling rapid feature iterations that bring the team closer to that goal, while optimizing operational and financial costs.

Incorporating Amazon S3 Express One Zone into Vivian’s infrastructure has led to substantial improvements:

  1. Cost reduction: The overall solution cost decreased from $8.9k/month to $2.6K/month, which is a reduction of approximately 71%.
  2. Performance enhancement: The accelerated performance of S3 Express One Zone (up to 10x faster than S3 Standard) reduced storage latency by 75% and compute resource requirements, bringing down the compute costs from $213/day to $27/day.
  3. Scalability: The scalability of S3 Express One Zone ensures that Vivian can handle step-ups and rapid shifts in data volumes without rearchitecting or scaling up compute infrastructure.
  4. Simplicity: The low-latency capability of S3 Express One Zone allows Vivian to focus on rapid iteration and not on having to build over-complicate existing infrastructure.

The real-time chat context for Vivian AI Copilot benefits significantly from the low-latency capability of S3 Express One Zone. This allows the AI Copilot to provide quick and accurate responses regarding job postings and related queries, thereby enhancing the customer experience.

We’re enthusiastic about continuing to drive innovation in healthcare staffing to improve transparency and streamline the recruitment journey for all clinicians.

Dhaval Shah

Dhaval Shah

Dhaval Shah is a Senior Solutions Architect at AWS, specializing in Machine Learning. With a strong focus on digital native businesses, he empowers customers to leverage AWS and drive their business growth. As an ML enthusiast, Dhaval is driven by his passion for creating impactful solutions that bring positive change. In his leisure time, he indulges in his love for travel and cherishes quality moments with his family.

Amee Shah

Amee Shah

Amee Shah is a Senior Solutions Architect at AWS, specializing in Data Analytics. She has helped enterprises build scalable, efficient cloud solutions that drive business value. Amee is passionate about using technology to help organizations gain data-driven insights to improve customer experience and identify new business opportunities. Outside of work, she loves to travel and spend time with her family.

Brandon Phillips

Brandon Phillips

Brandon Phillips is a Principal Software Engineer at Vivian Health where he leads the infrastructure team in improving platform stability and internal developer experience. He has 10+ years of experience in cloud-native web application development, encompassing frontend, backend, DevOps, and everything in between. He's particularly passionate about TypeScript, Serverless, and No-Code/Low-Code automation. Away from his desk, Brandon enjoys knitting, musicals, horror movies, and especially hanging out with his Aussiedoodle, Obi.

Eric Stouffer

Eric Stouffer

Eric Stouffer is a Principal Solutions Architect specializing in storage. With over 20 years of experience in enterprise compute and storage, he has helped some of the largest strategic customers at AWS to build compelling and innovative solutions. Eric is particularly interested in high performance use cases such as large-scale ML model training. Eric is also an avid mountain biker, and his three young children keep him busy with various outdoor adventures.

Oleg Chugaev

Oleg Chugaev

Oleg Chugaev is a Principal Solutions Architect and Serverless evangelist with 20+ years in IT, holding 9 AWS certifications. At AWS, he drives customers through their cloud transformation journeys by converting complex challenges into actionable roadmaps for both technical and business audiences.

Tycen McCann

Tycen McCann

Tycen McCann is a Sr. Customer Solutions Manager supporting AWS digital native customers with a focus on the media and entertainment industry. Tycen works closely with customers to realize their strategic initiatives through AWS products and services. He brings AWS best practices and resources to accelerate time to value and drive measurable outcomes. Outside of work, Tycen is an outdoor enthusiast who enjoys exploring nature with his family.