Mobileye: Revolutionizing HD map creation for autonomous vehicles with Spark on Amazon EKS

Mobileye (Nasdaq: MBLY), a global leader in advanced driver-assistance systems (ADAS), is at the forefront of the autonomous driving revolution. Founded in 1999, they have pioneered groundbreaking technologies such as REM™ crowdsourced mapping, True-Redundancy™ sensing, and Responsibility-Sensitive Safety (RSS). These innovations are paving the way for a future filled with self-driving vehicles and advanced mobility solutions. Over 125 million vehicles worldwide already possess Mobileye technology.

In 2022, Mobileye became an independent company separate from Intel while retaining majority ownership. As they continued to shape the future of autonomous driving, they faced the challenge of creating affordable, feature-rich high-definition (HD) maps. The Mobileye Road Experience Management (REM) group, responsible for HD map creation, rose to the challenge by building a robust microservices ecosystem on Amazon Web Services (AWS). A key component of this solution was the ability to effectively scale Apache Spark workloads on Amazon Elastic Kubernetes Service (EKS) clusters.

Scaling challenges

This post explores how Mobileye tackled scaling challenges and significantly accelerated their HD map creation process. Their strategic decision to use Kubernetes for managing data workloads on their massive Amazon EKS setup, boasting 3,000 nodes, 170,000 worker cores, and 30,000 pods in one EKS cluster, (and up to 450K vCPUs in multiple clusters) proved instrumental. To address specific challenges encountered with running Spark workloads on Amazon EKS, Mobileye partnered with the Data on EKS (DoEKS) team to scale these workloads effectively.

Four key reasons why Mobileye chose Kubernetes for data workloads:

Scalability: Kubernetes excels at scaling resources efficiently, employing tools such as Karpenter and cluster autoscalers. This functionality allows Mobileye to dynamically adapt resources in line with the fluctuating needs of their projects.
Orchestration: Kubernetes automates crucial processes such as job resubmission after failures and precise resource allocation at the job level. This makes sure that operations are smooth and resources are used optimally.
Dynamic configuration: Kubernetes facilitates the management of various settings on a per-job basis, such as environment variables, resource requirements, and container images. This flexibility is crucial for making swift and seamless adjustments to meet evolving project requirements.
Standardization: Kubernetes allows Mobileye to use the extensive ecosystem of open-source and commercial tools and services designed to seamlessly integrate with Kubernetes workloads. By adopting Kubernetes as a standardized platform, Mobileye can incorporate tools such as the Spark Operator and other open-source solutions that are pre-built to support Kubernetes environments.

Our journey with Amazon EKS

Our strategic decision to embrace Kubernetes led us to Amazon EKS as our preferred solution on AWS. Migrating our workloads to Amazon EKS provided the flexibility and isolation our applications demanded. We needed the ability to configure unique environments without the hassle of redeploying infrastructure. With Amazon EKS, we can run each Spark application within its own isolated environment, significantly streamlining our debugging processes. Amazon EKS decreases the operational burden of managing complex Kubernetes deployments. This AWS managed service has made Kubernetes installation and operation effortless for our DevOps teams. By using Amazon EKS, Mobileye has achieved the agility and efficiency necessary to build and run our machine learning (ML) workloads on Apache Spark.

Mobileye REM cloud-spark architecture

In our quest to develop a robust platform for both users and systems to effortlessly create and manage Spark applications, we set ambitious targets for the internal Mobileye REM cloud Spark architecture, and decided on the Spark Operator as a pivotal element in this strategy.

Our system requirements were clear and demanding:

Uncompromising availability: Achieve 99.95% uptime to make sure of continuous operation.
Resilient design: Support usage across multiple EKS clusters for enhanced resilience and flexibility. Enable operation across multiple to optimize performance and reliability.
Lightning speed and scale: Deliver exceptional speed and scalability, scaling up to 450K vCPU and up to 30K pods running on up to 3000 nodes per cluster.

The path to migration

To meet these goals, we developed an internal API responsible for handling requests to create Spark applications, tasked with initiating these applications on the designated cluster. However, the challenge of actually launching the Spark application remained, a task known for its complexity in configuration.
We decided to use the open-source Spark Operator. This tool provides us with a direct API to create Spark applications, significantly simplifying the process. Combined with EKS blueprints, which demonstrated best practices for deploying the Spark operator on Amazon EKS, and the expertise of the DoEKS team to help us tailor the blueprints to our specific use case, we were able to streamline the deployment and make sure of optimal performance for our Spark workloads on the Kubernetes platform.

As we progressed with transitioning REM systems to this new platform, Mobileye prepared for the migration with optimism. However, we soon encountered various challenges related to Amazon EKS operation and the scaling of other open-source software (OSS) components. These hurdles underscored the complexity of migrating to a high-scale, sophisticated cloud environment.

Networking with VPC CNI

Amazon EKS simplifies network setup by pre-allocating a block of IP addresses to each instance. Although this is convenient for smaller deployments, this approach can lead to challenges when scaling up.

The number of IPs pre-allocated is tied to instance size. In large clusters, these blocks can consume your subnet’s IP space rapidly, even if the actual pod density per node is relatively low. This ultimately limits the total number of nodes you can deploy within a subnet.

To address this challenge, Mobileye REM Cloud carefully tuned VPC CNI settings to match our workload requirements (~10 pods + daemon sets per instance):

WARM_PREFIX_TARGET = "0"
WARM_ENI_TARGET    = "0"
MINIMUM_IP_TARGET  = "10"
WARM_IP_TARGET     = "5"

Minimize upfront IP allocation: By setting both WARM_PREFIX_TARGET and WARM_ENI_TARGET to 0, Mobileye reduced the initial number of IPs assigned per instance, preventing unnecessary allocation.
Control minimum and warm IP targets: MINIMUM_IP_TARGET defines the minimum number of IPs guaranteed for each instance, while WARM_IP_TARGET specifies the buffer of readily available IPs. These settings allowed Mobileye to fine-tune IP allocation based on their specific needs.
CIDR allocation: Two /16 Secondary CIDRs per cluster, providing a total of approximately 128k IP addresses.
Subnet distribution: Each /16 CIDR is subdivided into four /18 subnets (16k IP addresses each) for optimal distribution across Availability Zones (AZs).
AZ balancing: Three AZs are used:
- Two AZs with three /18 subnets each (48k IP addresses/AZ)
- One AZ with two /18 subnets (32k IP addresses)
Cluster subnets: Each EKS cluster uses three distinct subnets for balanced resource allocation.

These settings significantly reduce IP usage. For very large EKS clusters, consider employing larger subnets and spanning multiple AZs for maximum capacity and resilience. Proactively plan your IP space by using routable subnets with smaller CIDR ranges for load balancers and public interfaces, and use non-routable secondary CIDRs with larger ranges to accommodate Spark pods.

CoreDNS autoscaling and node local cache for the rescue

As soon as the IP address problem was resolved, we began to encounter issues with DNS resolve errors. We got multiple unknown host failures, and many things acted poorly on the cluster, as the internal resolution was harmed as well. We expected this and had scaled CoreDNS upfront, so this caught us by surprise.

The default configuration for CoreDNS in Amazon EKS is two pods per cluster, but when you plan to run thousands of nodes and ten times more pods, two pods are not enough to handle the service for all of them. This issue can be avoided by horizontally scaling CoreDNS pods with Cluster Proportional AutoScaler, where CoreDNS pods can be scaled by the number of worker nodes that you are spinning in your cluster.

But still, although we had scaled, the resolve issues still occurred and even happened more as we scaled more!

At that point, we came to realize that CoreDNS has a scaling issue. As you are scaling more and more, one of the most important features of DNS is not scaling with your local Caching.
Each CoreDNS pod maintains its individual cache, and since our requests predominantly target a specific set of addresses, caching could be advantageous. However, as the requests get distributed across multiple pods, we fail to use the cache effectively. Consequently, we end up overwhelming the cluster network with redundant requests, resulting in unnecessary overhead. Furthermore, Kubernetes’s ndots configuration compounds the issue by generating significant noise when resolving non-Kubernetes addresses (additional details can be found in this Kubernetes pods post).

The solution lies in adopting Nodelocal DNS Cache. With Nodelocal DNS Cache, each node possesses its own cache, thereby limiting requests to only those addresses not already cached. By implementing node-local cache, we have managed to decrease our DNS network traffic by 90%. It should be noted that it is advisable for CoreDNS pods not to be hosted on workload nodes (such as Spark workloads) and use a dedicated core node group.

Successfully managing Amazon EKS control plane scaling

One of the standout benefits of opting for Amazon EKS is its capability to automatically scale the control plane based on specific metrics, alleviating the operational overhead of managing a Kubernetes control plane. Initially, as EKS clusters began to scale, the process was gradual. However, as Mobileye’s infrastructure expanded to encompass 200 nodes, we encountered significant challenges: API calls began timing out, webhooks failed, and pods experienced random crashes.

This led us to seek assistance from AWS Support . They advised us to enable Amazon EKS Control Plane Logging, a step that revealed critical insights. Through logging, we encountered KubernetesClientException and timeout errors, which indicate that our applications scale faster than control plane can handle at this point.
EKS Control Plane’s design for auto-scalability is robust, yet Mobileye’s scale tests presented an unprecedented scenario: attempting to scale from zero to thousands of nodes within minutes.

Mobileye worked closely with the Amazon EKS service team to make sure that our clusters could scale from zero to thousands of nodes within minutes.

Navigating the challenges of Spark local storage optimization

For shuffle purposes and local storage, we had to have some local disks. At first, our idea was to have an Amazon Elastic Block Store (EBS) volume per pod. In that manner, each spark app would require appropriate storage according to its needs, and the EBS volume would be destroyed when the app is finished, making the design very cost-efficient.

To do that, we adopted Amazon EBS CSI driver for dynamic volume creation, but we have had issues with scaling and Amazon Elastic Compute Cloud (Amazon EC2) API limits. Using the following parameters helped a bit:

csi-attacher:
- '—kube-api-burst=20’
- '—kube-api-qps=20.0’
- '—worker-threads=500'
- '—reconcile-sync=10m’
 
csi-provisoner: 
- '—kube-api-burst=20’
- '—kube-api-qps=20.0’
- '—worker-threads=300'

We encountered initial hurdles upon scaling to full capacity. Upon conveying our experience and requirements to the EBS Service team, they collaborated closely with us to address these challenges, releasing performance enhancements to the EBS CSI driver, which we are in the process of integrating in our cluster.

In the meantime, we devised a solution with the DoEKS team, employing Karpenter Userdata to dynamically create an EBS volume for each node. This approach tailors the EBS volumes to the number of cores available on each node, making sure of cost efficiency by precisely matching the volume size to the node’s requirements. This prevents the over-provisioning of EBS volumes for smaller instances, aligning resources more closely with actual needs. This approach also minimizes the number of EC2 API calls, further optimizing system performance and efficiency.

Using Karpenter autoscaling for Spark on Kubernetes

With the Spark on Kubernetes setup, initiating a spark-submit operation triggers a resource request to the control plane for pod creation. This process involves the Spark driver creating a Spark context and requesting resources from the control plane. However, actual instance creation requires either an existing instance or an autoscaling solution. Given Mobileye’s dynamic scaling needs, a fast and adaptable autoscaling solution, was essential for cost efficiency.

After reviewing the available options for Amazon EKS, advice from Mobileye’s R&D teams, and AWS consultants, we unanimously favored Karpenter for several compelling reasons:

Versatility in configuration: Karpenter allows you to configure multiple node pools with different settings and configurations, enabling effective management of your Kubernetes cluster’s compute resources. This capability offers workload segregation by allowing you to configure specific hardware configurations (CPU, memory, GPU) for different node pools, tailored to each Spark application.
Scalability and reliability: It is designed for high scalability, availability, and maintenance, with official support from AWS.
Flexible instance options: Karpenter allows a mix of Amazon EC2 Spot Instances (Spot), On-Demand, and Reserved Instances, allowing us to build cost-optimized Spark applications.. And it integrates seamlessly with AWS features such as AWS Identity and Access Management (IAM) policies and Amazon EBS, which helps us create a holistic configuration.

For Spark specifically, Mobileye had precise requirements:

Drivers should run On-demand/RI.
100% of workers run on Spot instances.
From 0 to 100% usage (100% means 100,000 CPU and even more) instantly.
Effective scale-down capabilities.
Efficient management of spare capacity.
The ability to add or modify configurations, such as new instances types and Amazon Machine Images (AMI), without disrupting service.
Running with multiple configurations according to the Spark app requirements.
Separation between the Spark workload and other system services to avoid conflicts and service interruptions.

To meet these specifications, Mobileye opted for multiple Karpenter Provisioners (referred to as node pools starting with Karpenter version 0.32), establishing dedicated provisioners for the following:

Spark Drivers
Spark Workers
Other infrastructure needs such as core node groups

The Spark app is defined to use its provisioners using the affinity, taints, and tolerations. A new configuration can either be added or edited to the current provisioners when it is required to change the configuration.

From the tests that Mobileye performed, it is clear that Karpenter can handle multiple provisioners without issues. We have tested over 15 provisioners without issues. The only concern is that the provisioners must be configured correctly, because if one of the provisioners is misconfigured, it can cause, in some cases, functionality problems. Karpenter performed very well both in terms of stability and latency, even at large scales and for use cases that required spikes from 0 to 100K vCPUs at once.

As for efficient scale-down and spare capacity management, Mobileye found it easy to implement by using ttlSecondsAfterEmpty. When an instance has no pods working on it, it waits ttlSecondsAfterEmpty before scaling down, thus allowing scale down only when needed, and reserving spare capacity so new Spark apps can start immediately on a busy environment.

By default, Kubernetes uses empty nodes as a first option if it has both empty and used nodes. To be more efficient, Mobileye used this podAffinityTerm configuration to create new pods on occupied nodes (and allow empty nodes to be scaled down accordingly):

podAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: nodeType
                operator: In
                values:
                  - executor
          topologyKey: kubernetes.io/hostname

Optimize inter-AZ network traffic

By default, Kubernetes in AWS tries to launch your workload into nodes bound by multiple AZs. As this deployment pattern aligns with Spot diversification requirements, it poses a potential problem. As a first point, cross-AZ latency is typically in single-digit milliseconds and when you compare this to nodes within Single-AZ (with microsecond latency), this affects your shuffle service performance. Choose nodes in Single-AZ to improve Spark shuffle performance since it involves disk, data serialization, and network I/O. The second reason is that cross-AZ communications between EC2 instances incur a cost of $0.01/GB.

In our topology, we chose to prefer Single-AZ for the Spark workers of the same Spark application by using Pod Affinity.

It allows us to use multiple AZs for Spot availability and cost reduction, as most of the traffic is in the shuffled data between Spark workers of the same application.

Therefore we added the following to the mentioned podAffinity

      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: sparkoperator.k8s.io/app-name
                operator: In
                values:
                  - APP_NAME_PLACEHOLDER
          topologyKey: topology.kubernetes.io/zone

Observability and logging in Spark

One of the strongest capabilities in running Spark on Kubernetes is the ability to observe and monitor resources for your Spark applications. For example, if there is some issue related to storage or memory in a specific Spark application, in Spark standalone you would have to check which Node is running the problematic application and look for the relevant logs, which is time-consuming and often nearly impossible. In Kubernetes, each Spark application can get the App name as a prefix (which is a Default Spark operator configuration) so you can view the driver and executor logs in an isolated manner. You can also add metrics, view them in Prometheus or any other metric solution, and analyze the logs with Loki or an equivalent solution.

Using Kubernetes, you can integrate with other third-party monitoring solutions.

We’ve chosen to send some metrics directly to the remote storage backend.

We also created some Amazon CloudWatch dashboards from our metrics, presenting things such as the number of apps running, driver cores, worker cores, issues discovered, and more.

Here is snippet from a regular production day:

* On this typical production day, you can observe the quantity of applications deployed across multiple EKS clusters, distributed across two AWS Regions. Additionally, the data presents the number of instances, executor cores, and Spot instance terminations occurring within each respective AWS Region.

Spark operator modifications

During load tests we revealed that the Spark operator slows down when the scale approaches more than 1000 apps overall, or more than 200 apps at once. To address that, we modified the operator, and we are trying to push it back to the general Spark operator repo. As kubeflow is now managing the Spark operator project and it became much more active, we hope our version would be accepted soon to the general repo.

Using multiple EKS clusters for enhanced performance

Our journey revealed that managing more than 1500 Spark applications within a single EKS cluster could compromise performance, despite fine-tuning the Spark Operator and equipping it with powerful instances. To address this challenge, we embraced a horizontal scaling strategy by distributing workloads across several EKS clusters. This approach was facilitated by the use of DoEKS Blueprints, enabling us to deploy and administer multiple clusters efficiently while maintaining consistency in configurations. This strategic deployment not only improved system performance but also streamlined cluster management, showcasing the scalability and flexibility of Amazon EKS in handling extensive application loads.

Conclusion

Mobileye’s adoption of Amazon Elastic Kubernetes Service (EKS) and Karpenter, coupled with the strategic support from the AWS support and Data on EKS (DoEKS) team, has been instrumental in revolutionizing the creation of high-definition (HD) maps essential for autonomous vehicles. This concerted effort has allowed Mobileye to surmount the complexities of scaling, optimizing, and managing data workloads effectively. Using Amazon EKS, Mobileye scaled to run more than 400,000 virtual CPUs concurrently on thousands of Amazon Elastic Compute Cloud (Amazon EC2) instances while reducing developer overhead by 50% with using AWS Managed Services.

The use of Amazon EKS has provided a robust, scalable infrastructure that simplifies operations, while Karpenter has enabled dynamic, efficient autoscaling, making sure that resources meet the demands of intensive data processing tasks seamlessly. Together, these technologies have propelled Mobileye to the forefront of innovation in autonomous driving, showcasing the transformative potential of Amazon EKS and Karpenter in driving technological advancements within the automotive industry.

Also, be sure to check out the recent AWS Podcast discussing our experiences in building the Mobileye platform on Amazon EKS, where we dive deeper into the challenges we faced and the innovative solutions we employed to overcome them.

Containers