Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – Part 2

In Part 1 of this series, we introduced Amazon SageMaker Fast Model Loader, a new capability in Amazon SageMaker that significantly reduces the time required to deploy and scale large language models (LLMs) for inference. We discussed how this innovation addresses one of the major bottlenecks in LLM deployment: the time required to load massive models onto accelerators. By streaming model weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, Fast Model Loader can achieve up to 15 times faster loading times compared to traditional methods.

As the AI landscape continues to evolve and models grow even larger, innovations like Fast Model Loader become increasingly crucial. By significantly reducing model loading times, this feature has the potential to transform the way you deploy and scale your LLMs, enabling more responsive and efficient AI applications across a wide range of use cases.

In this post, we provide a detailed, hands-on guide to implementing Fast Model Loader in your LLM deployments. We explore two approaches: using the SageMaker Python SDK for programmatic implementation, and using the Amazon SageMaker Studio UI for a more visual, interactive experience. Whether you’re a developer who prefers working with code or someone who favors a graphical interface, you’ll learn how to take advantage of this powerful feature to accelerate your LLM deployments.

Solution overview

Fast Model Loader is currently integrated with SageMaker Large Model Inference (LMI) containers (starting with v13) for GPU instances. It introduces two key techniques to enable lightning-fast model loads:

Weight streaming
Model sharding for streaming

Use Fast Model Loader with the SageMaker Python SDK

In this section, we show how to use the SageMaker Python SDK to use this new feature. You can find the example notebook in the following GitHub repo. Complete the following steps:

First, use ModelBuilder to prepare and package the model inference components.

To learn more about the ModelBuilder class, refer to Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 1: PySDK Improvements. In this example, you deploy the Meta Llama 3.1 70B model with the model name meta-textgeneration-llama-3-1-70b in Amazon SageMaker JumpStart.

The SchemaBuilder parameter is used to infer the serialization and deserialization methods for the model. For more information on SchemaBuilder, refer to Define serialization and deserialization methods.

You can choose to specify OPTION_TENSOR_PARALLEL_DEGREE as a ModelBuilder environment variable as shown in the following commented lines, or in the next step as part of the ModelBuilder sharding_config:

from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
import logging

# Define sample input and output for the model
prompt = "Falcons are"
response = "Falcons are small to medium-sized birds of prey related to hawks and eagles."
# Create the input schema structure
sample_input = {
    "inputs": prompt,
    "parameters": {"max_new_tokens": 32}
}
# Define the expected output format
sample_output = [{"generated_text": response}]

model_builder = ModelBuilder(
    model="meta-textgeneration-llama-3-1-70b",
    role_arn=role,
    sagemaker_session=sess,
    schema_builder=SchemaBuilder(sample_input=sample_input, sample_output=sample_output),
    #env_vars={
    #   "OPTION_TENSOR_PARALLEL_DEGREE": "8",
    #},
)

Next, use the optimize() function to prepare the model shards for deployment.

The optimize() function will start a model optimization job and will take a few minutes to finish. The tensor parallel degree should be set to how many GPUs you want each inference component to have access to. You can find the model shards at the output_path S3 location under a folder starting with sagemaker-fast-model-loader-xxx.

model_builder.optimize(
    instance_type="ml.p4d.24xlarge", 
    accept_eula=True, 
    output_path=output_path, 
    sharding_config={
            "OverrideEnvironment": {
            # The value must be equal to the subsequent number of GPUs that will be used for each IC. 
                "OPTION_TENSOR_PARALLEL_DEGREE": "8"
            }
    }
)

You can reuse the sharded model that was generated by previous optimization jobs. The following code sample demonstrates how to use model_metadata to overwrite the model path, which needs to point to the Amazon S3 location of the existing model shards:

model_builder = ModelBuilder(
    model="meta-textgeneration-llama-3-1-70b",
    model_metadata={
        "CUSTOM_MODEL_PATH": output_path,
    },
    schema_builder=SchemaBuilder(sample_input="Test", sample_output="Test"),
    role_arn=role,
    instance_type="ml.p4d.24xlarge",
)

When the model optimization job is complete, you can use the build() function to generate the artifacts according to the model server:
```
# use the build() function to generate the artifacts according to the model server
final_model = model_builder.build()
```

If you’re using existing model shards without running an optimization job, you need to make sure the _is_sharded_model value is set to True and the EnableNetworkIsolation is set to False because Fast Model Loader requires network access:

# You only need to set the values if you are using existing sharded models 
if not final_model._is_sharded_model:
 final_model._is_sharded_model = True 
if final_model._enable_network_isolation:
 final_model._enable_network_isolation = False

Use the deploy() function to deploy the model to an endpoint, where you can specify the required resources, such as GPU memory and number of accelerators:

from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

resources_required = ResourceRequirements(
    requests={
        "memory" : 204800,
        "num_accelerators": 8
    }
)

# deploy the optimized model to an endpoint
final_model.deploy(
    instance_type="ml.p4d.24xlarge", 
    accept_eula=True, 
    endpoint_logging=False, 
    resources=resources_required
)

After the endpoint is up and running, you can test the endpoint using the following code example:

from sagemaker.predictor import retrieve_default 
endpoint_name = final_model.endpoint_name 
predictor = retrieve_default(endpoint_name) 
payload = { "inputs": "I believe the meaning of life is", 
            "parameters": { 
                "max_new_tokens": 64, 
                "top_p": 0.9, 
                "temperature": 0.6 
            } 
        }
response = predictor.predict(payload) 
print(response)

To clean up, run the following code cell to delete the resources created for the endpoint:
```
predictor.delete_predictor()
predictor.delete_endpoint()
```

Use Fast Model Loader with SageMaker Studio

In this section, we show how to use the faster model loading feature through the SageMaker Studio UI. Complete the following steps:

On the SageMaker Studio console, chose JumpStart in the navigation pane.
Choose your model.
On the model details page, choose Optimize.
Accept the EULA and proceed to the optimization configurations.
Select Fast model loading and set the OPTION_TENSOR_PARALLEL_DEGREE to 8, because this example uses an ml.p4d.24xlarge instance that has 8 GPUs. If you’re using an instance with a different number of GPUs, set the value to match the instance.
Set the output path to the Amazon S3 path where the sharded model will be stored.
Choose Create job.

After the inference optimization job starts, you can check the status of the job on the Inference optimization page. Here, each of the jobs have tags associated with them as to what optimization configuration was used.

View the details of the job by choosing the job ID.
Deploy the optimized model by choosing Deploy on the optimized job page.
Verify the endpoint settings and choose Deploy to initiate a SageMaker endpoint deployment.

You will get a notification on the SageMaker Studio UI, and the status will change to In service when the endpoint creation is complete.

You can now send a sample inference request to test the model.

After the test, you can delete the endpoint from the SageMaker Studio console to clean up the resources created in this example.

Supported models

Faster Model Loading on SageMaker now supports all FP16 models compatible with vLLM. For a complete list of compatible models, please refer to the vLLM supported models documentation.

Conclusion

Fast Model Loader represents a significant advancement in how you can deploy and scale LLMs on SageMaker. In this post, we walked through the step-by-step process of implementing this feature through both the SageMaker Python SDK and SageMaker Studio UI. By using weight streaming and model sharding techniques, you can now achieve dramatically faster model loading times, enabling more responsive scaling for your LLM-based applications.

The integration with SageMaker LMI containers (starting from LMI v13) makes it straightforward to adopt this feature in your existing workflows. Whether you’re dealing with bursty traffic patterns or need to rapidly scale your LLM services, Fast Model Loader provides the tools you need to optimize your model deployment pipeline.

Try out Fast Model Loader for your own use case, and leave your feedback and questions in the comments.

About the Authors

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.You can find him on LinkedIn.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.

Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Giuseppe Zappia is a Principal AI/ML Specialist Solutions Architect at AWS, focused on helping large enterprises design and deploy ML solutions on AWS. He has over 20 years of experience as a full stack software engineer, and has spent the past 5 years at AWS focused on the field of machine learning.

AWS Machine Learning Blog