Posted On: May 10, 2021
Amazon SageMaker now supports Elastic Fabric Adapter (EFA) for training machine learning models. EFA is a network interface for Amazon EC2 instances that enables customers to run applications requiring high levels of inter-node communications at scale on AWS. EFA can significantly speed up distributed training on SageMaker at no additional cost. For example, we trained the BERT natural language processing model with SageMaker’s distributed data parallel library on 32 ml.p4d.24xlarge instances. The training was up to 130% faster with EFA compared to Elastic Network Adapter (ENA).
Distributed training enables developers and data scientists to train models faster and improve model quality. Customers use the SageMaker distributed training libraries because they offer fast and easy methods for training large deep learning models and datasets. EFA’s unique operating system bypass networking mechanism enhances the performance of inter-instance communications, leading to even faster distributed training on SageMaker.
There is no additional cost to using EFA on SageMaker. EFA in SageMaker is currently supported on ml.p3dn.24xlarge, ml.p4d.24xlarge, and ml.c5n.18xlarge instances. SageMaker distributed training jobs that use the TensorFlow and PyTorch Deep Learning Containers automatically take advantage of EFA without any action from customers. EFA can be enabled for training jobs that use VPC or a custom Docker image with minimal configuration changes.
To learn more about EFA support in Amazon SageMaker, please view the documentation for the SageMaker distributed training library or how to run training with EFA in your container. To get started, log into the Amazon SageMaker console.