Containers

How Smartsheet optimized cost and performance with AWS Graviton and AWS Fargate

The post was co-written by Skylar Graika (Sr Principal Engineer, Smartsheet)

Introduction

Highly successful companies know that maintaining an accelerated pace of innovation is key to continued growth. They are increasingly looking to modernize their digital backbone of applications and development practices to support faster innovation and improved performance, security, and reliability, while maintaining a focus on cost. Building a modern application means building with the latest technologies and development techniques.

One way in which modern applications are built and deployed is as a collection of modular services running on managed container orchestrators and serverless compute technologies that support agility and reduce operational overhead. Whether running stable, predictable workloads or more volatile and changing workloads, services such as AWS Fargate can meet these demands while enabling faster development.

From a cost lens, modern application services, such as serverless containers, allow organizations to lower their total cost of ownership. AWS Fargate serverless compute allows application teams to run containerized workloads without the operational overhead of infrastructure management, which makes it easy to deploy, manage, and scale containerized applications without the complexity of managing a control plane, nodes, or instances. Modernizing applications with serverless containers improves agility so teams can innovate faster and save on infrastructure costs by improving utilization and offloading many server management tasks to AWS. Organizations can also use AWS’ best practices and expertise on performance, scalability, and availability, to run highly performant and resilient applications.

AWS offers a variety of options to help organizations optimize their cloud spend. Purchase options, such as AWS Spot instances, take advantage of unused capacity in the AWS Cloud and allow users to run applications in containers and obtain an average savings of 65% on infrastructure without impacting the application (see Optimize cost for container workloads with ECS capacity providers and EC2 Spot Instances). AWS Fargate Spot, much like Amazon Elastic Compute Cloud (Amazon EC2) Spot, allows tasks to run on spare capacity in the AWS Cloud when capacity is available at up to a 70% discount off the AWS Fargate price. Tasks are launched based on specified requests, and when AWS needs the capacity back, a notification is sent to the customer. This allows fault-tolerant workloads to run in the cloud at an optimized cost.

AWS Fargate running on AWS Graviton-based Amazon EC2 instances or AWS Graviton processors deliver the best price performance for cloud workloads, providing up to 40% better price performance over comparable Amazon EC2 instances. AWS also offers Compute Savings Plans which can reduce costs up to 72% in exchange for a usage commitment. The variety of purchase options and compute infrastructure allows customers to best place workloads on the configurations that meet their needs for cost optimal solutions. Many organizations have taken advantage of these options, including AWS customer Smartsheet. Smartsheet has optimized their spend, reinvested savings, and scaled innovation with AWS.

Smartsheet’s journey optimizing to reinvesting into innovation with AWS Fargate

Smartsheet is a leading Software as a Service (SaaS) platform for enterprise work management that enables teams and organizations of all sizes to plan, capture, manage, automate, and report on work at scale, which results in more efficient processes and better business outcomes. Smartsheet has over 100,000 customers on the platform, 13.4 million collaborators, 6 billion active tasks, and approximately 2.4 PB of assets on the platform. From servicing 75,000 users in 2010 to over 10 million in 2023, Smartsheet’s expansion has been exponential. However, with growth comes its own set of challenges. How can an organization ensure cost optimization, operational agility, and seamless scalability in the face of rapidly evolving demands? This is the question Smartsheet sought to answer through their journey with AWS. Smartsheet has a long-standing partnership with AWS, which launched in 2006 with assets hosted in Amazon Simple Storage Service (Amazon S3). Over the years, Smartsheet has worked on co-innovation with AWS by moving to microservices and setting the ground for full migration to AWS. Smartsheet is now 100% running on AWS across multiple Regions.

As Smartsheet moved to adopting AWS Fargate, they were able to implement a number of levers to continuously optimize for cost. Levers that include rightsizing, elasticity, using Compute Savings Plans, as well as adopting Spot and AWS Graviton. Smartsheet adopted an ‘Optimize, Propel, and Scale (OPS)’ framework across their optimization journey that has proven extremely successful.

“Our engineering team has realized 70% cost reduction with Fargate and Graviton. These savings are significant, especially in the current economic climate where we need to make every dollar stretch further.” – Praerit Garg, CPO & EVP Engineering, Smartsheet.

Optimize

Reducing costs is a vital component as Smartsheet looked at optimization for their services. It’s not just reducing the cloud bill, but rather looking at optimization as a journey across different aspects. Along with cost reduction, running highly efficient operations helps ensure every engineer is able to perform the same tasks, minimize waste through automating manual operations, and environmental sustainability – analyzing the carbon footprint of systems to minimize impact on the environment. In Smartsheet’s case, moving to AWS Graviton allowed them to use up to 60% less energy than comparable instances.

Smartsheet’s cloud optimization journey

Originally, Smartsheet operated its own data centers. As they moved to AWS, they carved out microservices during the migration. From there, they looked at several levers for optimization that include:

  • Looking at the right instances for the workload. Optimizing the type of instances required (whether CPU optimized or memory optimized) for the workload to maintain high performance while controlling costs.
  • Improved elasticity also comes into play here because you want to take advantage of autoscaling so you’re only paying for instances when needed. Smartsheet gets the majority of traffic during a workweek. Being able to scale back down during the weekend when traffic is low and is another way Smartsheet saves on costs.
  • Another vital lever is forecasting, which is understanding and reviewing the metrics and making changes based on the available data. This can include Application Performance Monitoring (APM) and capacity planning by figuring out the baseline of your instances. This also helps you choose the right pricing model as you understand what your reserved instance baseline is to plan for the on-demand capacity you may need for variable workloads.
  • Deploying containers and using serverless for optimization—AWS Fargate delivers cost savings beyond just the cloud bill by reducing cognitive load of managing servers.
  • Moving to AWS Graviton is another great step to cost optimize. Smartsheet initially thought they hit a roadblock moving their base images from x86 to Arm; however, in retrospect they wish they’d moved sooner. When they decided to make the change, they were able to switch their base images to Arm64 within a month, which saved 20% on costs and reduced their carbon footprint.

Propel

Propel is about operational excellence–how to operate a faster, more reliable, more available, more consistent, and more secure platform. Adopting managed services, like AWS Fargate, meant Smartsheet could offload the undifferentiated heavy lifting, focusing instead on the core business value they provide to their users. This happens through things like reducing maintenance, more automation, and enhancing security (i.e., taking advantage of the compliance and security certifications of AWS managed services). Smartsheet, went from weekly deployments to multiple deployments per day.

By using AWS CodePipeline with AWS Fargate, Smartsheet deployed independent and automated workflows, which simplified their infrastructure and assisted the transition from relying heavily on dedicated Site Reliability Engineers (SREs). This means Smartsheet can respond faster and more effectively to challenges, with each engineer having more time and ability to address a wider range of tasks.

Automated cell-based deployments is yet another way to drive efficiency. As engineers check in code, GitLab Continuous Integration (CI) runs the build process and triggers the pipeline. Smartsheet builds the docker images and pushes to Amazon Elastic Container Registry (Amazon ECR), which is a fully managed Docker container registry that simplifies the storage, sharing, and deployment of container images. Once in Amazon ECR, AWS CodePipeline’s native integration with Amazon ECR triggers off an image push to process and create a build artifact through AWS CodeBuild that’s passed to AWS CodeDeploy, which is deployed to AWS Fargate. Smartsheet uses a variety of managed services with existing integrations with AWS Fargate.

Automated Cell-Based Deployments

Scale

With a cell-based architecture, it is possible for Smartsheet to set a limit on how large a single cell can be and test that limit to failure. When more resources are required, another cell is stamped out. Within this cell, they introduce the AWS Fargate service, which is the operational heart of their shard, and use Amazon DynamoDB or Aurora MySQL for data store depending on the use case. Finally, to break out the synchronous from asynchronous jobs, Smartsheet used Amazon SQS for job-based processing. They broke a request coming into the system and sent it for processing, with another AWS Fargate task processing from the job queue. To route traffic to cells, they built a router to map which cell to route traffic to.

Cell-based architecture enhances testability. Unlike ever-expanding monoliths, it allows setting and testing cell limits, ensuring predictable performance. For more capacity, simply add another cell. – Ray Cole, Distinguished Engineer, Smartsheet.

Smartsheet’s cell-based architecture

Smartsheet anticipated future demands and scaled their infrastructure accordingly. By forecasting their needs, they ensured that they were not only reacting to present challenges but were well-prepared for the future, which ensured their system was always ahead of the curve and primed for growth. Through load tests, they simulated high-traffic scenarios, ensuring that their system was not only capable of handling current loads but was also ready for unexpected surges. This approach ensured resilience, stability, and a consistent user experience, even under extreme conditions.

As they scaled, they recognized the importance of resilience. Practices, like isolation, ensured that failures remained localized, preventing system-wide outages. Integrating kill-switches allowed them to quickly disable malfunctioning components, and circuit breakers added an additional layer of protection against cascading failures. These strategies weren’t just about building a bigger system; they were about building a smarter, more resilient one.

Advantages of a cell-based architecture for Smartsheet

Higher reliability

Having multiple cells reduces the scope of an impact as cells represent the core unit that provide containment of many failure scenarios. When properly isolated from each other, cells have failure containment similar to what we see with Regions. It’s highly unlikely for a service outage to span multiple Regions. It should be similarly unlikely for a service outage to span multiple cells.

Cell-based architectures reduces scope of impact

Higher scalability

It’s important to define, test, and manage the limits and capacity of a cell. By knowing and monitoring this capacity, Smartsheet defined limits and scaled the workload by adding new cells to the architecture, thus scaling it out.

Scale-out with adding new cells

Higher testability

Testing in distributed systems is a complex undertaking, and this challenge magnifies with the scale of the system. Smartsheet places a premium on the reliability and robustness of its services. This is where the beauty of cell-based architecture comes into play, particularly when focusing on the testability aspect.

One of the key strengths of a cell-based architecture is the capped size of individual cells, which enables a well-defined boundary for scaling. Given that their platform supports a myriad of use-cases—from project planning and resource allocation to reporting and automation—the predictable scaling behavior of individual cells allows their engineering team to perform tests with a high degree of accuracy. Smartsheet simulated the maximum workload that a cell needs to handle to guarantee performance and reliability under the most demanding conditions.

The limited size of cells also offers cost-effective testability. It is financially and logistically feasible to simulate the full range of operations within a single cell. Smartsheet’s platform often deals with large, enterprise-level workloads. Due to the cell-based structure, Smartsheet doesn’t have to simulate the complete workload of all tenants to validate performance. They only need to model the maximum workload that can fit into a single cell.

Therefore, cell-based architecture not only meets but elevates Smartsheet’s demanding requirements for testability. It allows them to simulate real-world scenarios in a controlled and bounded environment, ensuring that their services remain reliable, scalable, and above all, rigorously tested. This adherence to stringent testing protocols is part of Smartsheet’s commitment to delivering a high-quality, dependable platform for its users.

Conclusion

In this post, we showed you the key milestones in Smartsheet’s multi-year journey of powering their business with AWS. From implementing microservices and cell-based architectures with Amazon ECS and AWS Fargate, to migrating to AWS Graviton for improved performance, and cost optimizing with Compute Savings Plans and Spot, Smartsheet has utilized many levers available to control costs, maximize performance, and ultimately delight their customers. Checkout the “Containers from the Couch” episode here on how Smartsheet is using Amazon ECS and AWS Fargate for their workloads,

AWS Fargate provides the capability to deploy containers with a serverless operating model that emphasizes security and productivity by shifting infrastructure management of Amazon Machine Images (AMIs) and capacity to AWS. Explore how serverless containers can help power your workload today with these additional resources: Serverless Land, Containers on AWS, AWS Fargate Documents, and AWS Graviton Documents.

If you’re interested in the concepts introduced in this post, then please feel free to reach out using social media (Ravi Yadav, Steven Follis, Utsav Shah, and Skylar Graika).

"Headshot

Skylar Graika, Smartsheet

Skylar, a tech industry luminary and co-founder of Footmarks, has notably elevated IoT platform technology for numerous Fortune 500 companies. His expertise as a full-stack software engineer and an adept leader is reflected in his ability to transform complex ideas into tangible, high-impact products. Known for his strategic approach and innovative problem-solving skills, Skylar excels in orchestrating the development and deployment of cutting-edge technological solutions.