How Toyota Connected North America reduced storage costs by optimizing its growing data on Amazon S3

Toyota Connected North America, founded in 2016, focuses on developing and delivering advanced technology and data services for Toyota and Lexus vehicles. Toyota Connected’s mission is to make mobility a more accessible, exciting, and human-centric experience for everyone. To this end, Toyota Connected uses data connectivity to serve more than 8 million retail customers, hundreds of dealers, and fleet companies across North America.

Toyota Connected’s real-time data platform has experienced exponential growth since its launch just a half decade ago, as the number of connected vehicles on the road has dramatically increased. Big data analytics platform requirements are uncertain, as they are based on market conditions and need to evolve quickly to realize the company’s business initiatives such as increasing electrification in the product cycle. Therefore, data storage costs from these platforms are a major contributor to Toyota Connected’s overall budget.

In this post, we discuss how Toyota Connected North America leveraged managed services and features, such as Amazon S3 Intelligent-Tiering storage class, Amazon S3 Lifecycle, and Amazon Athena to achieve more than 40% savings per month, despite a 4% increase in data volume, to support business.

Data analysis and storage challenges

At a high level, the Toyota Connected real-time data platform consists of multiple layers:

The Ingestion Layer receives data from the hundreds of sensors in vehicles.
The Decoding Layer converts binary data into a usable format.
The Transform Layer transform the data, and publishes the enhanced data to the downstream customers through numerous data services, such as telemetry, driver score, collision notification, and safety services
The Analyze/ Consume Layer has a big data analytics platform, which processes vehicle data and applies machine learning/artificial intelligence (ML/AI) to provide insights for research and development, improve the quality of Toyota vehicles, and provide customer satisfaction.

An architecture diagram that explains multi-tiered Toyota Connected real-time data platform.

To address the rising costs associated with the growing number of vehicles, the team initially implemented retention policies to move the data from the S3 Standard storage class to the S3 Glacier Flexible Retrieval storage class using S3 Lifecycle. Although this approach provided the much-needed savings to reduce the overall cloud cost, it was necessary to kaizen these savings even further.

The challenge was that retention policies were static; most of these used simple rules such as moving data after 30 days from one storage tier to another. The second challenge was the uncertainty of the data requirements of the Big Data Analytics Platform, as they were based on market conditions. For example, data related to the Prime plug-in hybrid or bZ4X electric vehicle data should be more quickly retrievable due to its timeliness in supporting growing Toyota initiatives, such as electrification.

What was missing was the intelligence to determine which data was frequently used or accessed as opposed to the data that was infrequently accessed. As these access patterns changed, Toyota Connected needed a way to dynamically make data available as needed. Furthermore, if infrequently accessed data were moved to a low-cost storage class, such as S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive, then it still needed to be instantly accessible by the analytics workloads as business needs changed. For example, for Earth Day 2023, Toyota Connected needed to quickly pull the CAN data from vehicles driving in Normal and Eco Mode to compare the carbon footprint of Toyota-owned vehicles during a specific period. The team looked for a lean solution to these issues without adding operational overheads or needing to completely redesign its existing workloads and applications.

Solution

Toyota and AWS formed a focus group to analyze different datasets, their access patterns, data volume, and object count. The outcome of this analysis led to providing a list of categorized data sets that could benefit from specific storage approaches, as well as the potential cost savings associated with those approaches. S3 Storage Lens and S3 Storage Class Analysis proved helpful. S3 Storage Lens provided us a high-level summary of object-storage usage and activity, such as total storage, object counts, and average object size per storage class or S3 bucket. S3 Storage Class Analysis helped us identify how much of storage is “Infrequently Accessed.”

We followed suggestions from AWS for the buckets that would benefit most from optimizations. In general, these buckets were the ones that stored more data than others, had large average object sizes, and had long-lived objects that were not transient. We worked with the AWS team to estimate precisely how much cost savings we could expect for each bucket we analyzed. After that, we formulated a three-pronged strategy.

S3 Lifecycle configuration

In our analysis, we discovered the importance of differentiating between scenarios where a smaller number of objects were accessed many times as opposed to ones where a larger number of objects were accessed only once, such as when the access patterns were predictable. In the latter case, the savings from S3 Intelligent-Tiering would be lower than using S3 Lifecycle configuration. This is because S3 Intelligent Tiering has a small monthly monitoring and automation charge to monitor access patterns of the objects, and objects that have not been accessed for 30 consecutive days are transitioned automatically to the S3 Infrequent Access tier. For our data with predictable access patterns, we retained S3 Lifecycle configuration to these buckets to apply more customized rules.

S3 Intelligent-Tiering

Next, for the buckets with unpredictable access patterns, we wanted to gauge the impact of using S3 Intelligent Tiering. The team performed extensive analysis that involved verification of the downstream applications, such as Athena and Amazon EMR to make sure that there was no latency issue in retrieving data from the instant tiers. In addition, the daily jobs that used these AWS services were monitored to make sure that there was no adverse impact in operational efficiency.

After moving the appropriate, unpredictable access pattern data to S3 Intelligent-Tiering, the team checked the cost through AWS Cost Explorer. The first immediate change noted in February 2023 was a modest spike for about a day in costs due to the price of initial transitioning of objects over to the new tier in Amazon S3. After that spike passed, prices returned to normal as expected. Around 30 days after the switch, we began seeing 40% savings per month with a 4% month-over-month organic growth in data volume. This is because the S3 Intelligent-Tiering automatically moved infrequently accessed objects to the lower-cost storage tiers.

A bar graph with a line that compares the actual costs using S3 Intelligent-Tiering, projected costs without S3 Intelligent-Tiering, and data volume.

Duplicate data avoidance

Regarding data analysis and query processing on the big data analytics platform, Toyota Connected used Athena to query data in Amazon S3. Our datasets contained files in both S3 Standard and S3 Glacier storage classes. Simultaneously, Athena could not directly query restored data from S3 Glacier archival storage classes, such as S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive. However, we occasionally need to access to the archived data to meet ad-hoc business demands or troubleshoot technical issues.

To circumvent this issue and ensure seamless execution of queries, we implemented a workaround. This involved duplicating the restored objects, thereby creating additional copies of these restored objects in S3 Standard storage class so that Athena could query them. However, this approach introduced an unintended consequence in the form of increased costs due to the multiplication of data storage. We shared product feedback with the AWS team, as this overhead cost became a point of concern for our organization in leveraging Athena for our analytical needs.

On June 29, 2023, Athena released a new feature to directly query the restored S3 Glacier storage classes data without needing additional copies, thereby streamlining the process and significantly reducing costs. This enhancement marked a pivotal moment in optimizing data analysis workflows reliant on Athena. It allowed them to harness the full potential of their data while keeping expenses in check. This blog post shares further details of this new feature.

Conclusion

In this post, we discussed a three-pronged strategy to optimize storage cost. Implementing S3 Intelligent-Tiering allowed us to make sure our resources were being allocated efficiently, optimizing costs without any drop off in performance. To make the most of S3 Intelligent Tiering, it is essential to identify the buckets that contain data with varying access frequencies, as we did leading to the 40%-per-month savings. On the other hand, the S3 Lifecycle configuration helps us apply more customized rules to our buckets that have data with predictable access patterns. Additionally, using Amazon Athena to directly query restored data from S3 Glacier archive classes removes operational overhead and further saves on storage costs.

These optimization initiatives helped drive sustainable business growth for Toyota Connected’s real data platform, which has experienced exponential growth due to an ever-increasing number of connected vehicles. It also supported new strategic business plans, such as electrification and Earth Day external engagement initiatives, which had and continue to have uncertain data requirements.

Thank you for reading this blog. If you have any feedback for this post, leave your comments in the comments section.

AWS Storage Blog