Simplifying Medical Imaging AI Deployments with NVIDIA NIMs and AWS Services

Most practicing clinicians are not yet fully benefitting from the efficiency and diagnostic advances that medical imaging artificial intelligence (AI) promises. Additionally, many AI scientists and engineers struggle with the practical aspects of incorporating AI inferences in clinical workflows, and providing a consistent end-user experience when scaling to support millions of studies per year.

Still, the clinical practices of radiology and digital pathology are being transformed by AI. To-date, the US Food and Drug Administration (FDA) has approved 950 AI-enabled medical devices and 77% of those are in the radiology and pathology domains. The potential of AI is rapidly expanding as imaging foundation models are unlocking capabilities beyond what was possible with traditional computer vision approaches.

We will demonstrate how to streamline medical imaging AI deployments with NVIDIA NIM inference microservices and managed Amazon Web Services (AWS) including Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), and AWS HealthImaging.

NIM is an important new paradigm that provides easy-to-use microservices designed to accelerate the deployment of generative AI models across all industries. This includes medical imaging AI, where NIM (like VISTA-3D, a foundation model from NVIDIA) is transforming the industry using easy-to-deploy containers to accelerate last mile delivery of medical imaging AI applications.

Amazon SageMaker is a machine learning (ML) service, offering managed data processing, model training (including foundation models at scale), hyperparameter tuning, model inference and full MLOps capabilities.

Amazon EKS is a fully managed Kubernetes service running in the cloud and on premises, with integrated tooling through open-source standards.

The VISTA-3D NIM container has also been customized with a connector to AWS HealthImaging, a HIPAA-eligible, highly scalable, and cost-effective cloud service for storing medical imaging data. The integration accelerates medical imaging AI applications with sub-second image retrieval latencies at scale, powered by cloud-native APIs.

Using the solution we’ll describe how AI developers can build scalable and streamlined medical imaging AI applications for practical clinicians to speed up their clinical workflows and improve their productivity. For this solution, our use case will be segmentation of organs in computer tomography (CT) images from the chest.

Solution overview

The NVIDIA VISTA-3D NIM has an encoder-decoder based foundation model, named Versatile Imaging SegmenTation and Annotation model (VISTA-3D), which can be used for zero-shot, or open vocabulary segmentation. VISTA-3D segments over 120 organs and structures in CT scans. It is easy to work with VISTA-3D, because it presents a model inference endpoint through the industry-standard REST APIs. The frontend FastAPI process routes HTTP requests to a backend model inference process, which is hosted on an open source NVIDIA Triton™ Inference Server that deploys and optimizes scalable and low latency AI model inferences on GPUs.

Figure 1. NVIDIA NIM container architecture includes the libraries and tooling for low latency AI model inference

The following architecture in Figure 2 demonstrates how to deploy the VISTA-3D NIM on Amazon SageMaker, integrating with data stored on HealthImaging at scale.

Figure 2. Architecture diagram running NIM on Amazon SageMaker, integrating with data from AWS HealthImaging

The medical images in DICOM format will be staged in Amazon Simple Storage Service (Amazon S3) and imported to HealthImaging. The VISTA-3D NIM container will be downloaded from NVIDIA NGCTM™, NVIDIA’s repository of containers for AI/ML, metaverse and HPC applications. It will then be uploaded to a private repository with Amazon Elastic Container Registry (Amazon ECR), which is required for both SageMaker and Amazon EKS deployments.

SageMaker inference endpoints have built-in high availability, which means the NIM container will be deployed across multiple Availability Zones. You can also choose what type of hosting endpoints to use on SageMaker, like near real-time inference or asynchronous inference. You can also run a Jupyter notebook in SageMaker Studio, which makes it straightforward to deploy and manage the inference endpoint through AWS SDK for Python (Boto3).

It is also possible to deploy NIM containers on AWS with Amazon EKS using the architecture shown in Figure 3.

Figure 3. Architecture diagram to run the NIM container on Amazon EKS

In this architecture, you can deploy the container in private subnets and leverage AWS PrivateLink for secure network traffic. You can also use AWS Identity and Access Management (IAM) for role-based access and permission control. AWS Load Balancer Controller with Helm and Amazon CloudWatch Observability agent have been packaged in this NVIDIA NIM on EKS automated deployment and will be installed at the same time.

Prerequisites

Visit the NVIDIA API Catalog VISTA-3D model page, click on “Build with this NIM” button. If you have not logged in, you will be prompted to enter an email address. You can use a business email address, which provides a 90-day NVIDIA AI Enterprise license, or a personal email address, which will allow you to join through the NVIDIA Developer Program membership.

Once logged in, you can click on the same “Build with this NIM” button, then “Generate API Key” button to get your API Key to download the NIM container, or under any of the code snippet tabs (Shell, Python, or Node), select “Get API Key”. Once you get the API Key, follow the VISTA-3D documentation for detailed instructions to pull and run the VISTA-3D model container.

To deploy the VISTA-3D container on AWS, first create an AWS account. For the Amazon SageMaker deployment, setup a SageMaker notebook instance and download the sample code from this GitHub repo. For the Amazon EKS deployment, first create an EKS cluster using the Data-on-EKS automation script. You can then use AWS CloudShell or a local terminal with the command line tools (for example, kubectl and Helm) to deploy the container to the EKS cluster.

Deployment Walkthrough

Our first step is to build a custom container from the NIM base image provided by NVIDIA. You can build the image by using the Docker file from the GitHub repo listed in the prerequisites section. A Linux x86 environment is required to build the image. After that, create a private repository in Amazon ECR and push the container image to it. The customized container has a connector layer that can take medical images as input from either HealthImaging or Amazon S3.

When using Amazon S3 as the image source, the custom container layer will download the NIFTI or DICOM files from Amazon S3 using the Boto3 Python library. HealthImaging only supports DICOM images and they are stored in datastores, so you need to provide a DatastoreId and an ImagesetId (equivalent to bucket name and object name in Amazon S3).

When using HealthImaging, a single DICOM instance can be retrieved using GetDICOMInstance API action that is converted to a NIFTI format using SimpleITK. For multi-frame images, the container will download all of the pixel frames for a given ImageSet on HealthImaging, decode them using nvJPEG2000 using a GPU and convert the numpy arrays into nifty files using CuPy and SimpleITK.

You can post the requests to the NIM endpoints using the following example URIs:

For Amazon S3: s3://<s3bucket>/example-1.nii.gz
For HealthImaging DICOMweb API: https://dicom-medical-imaging.us-east-1.amazonaws.com/datastore/<datastoreId> /studies/<StudyUID>/series/<SeriesUID>/instances/<InstanceUID>?imageSetId=<imagesetId>
For HealthImaging GetImageFrame API: healthimaging://<datastoreId>/<imagesetId>

Once you have this customized container in Amazon ECR, you can deploy it on either SageMaker or Amazon EKS. To deploy on a SageMaker managed inference endpoint, this customized container listens on port 8080 and accept POST requests to the /invocations path. The SageMaker inference endpoints are managed with automatic health checks, load balancing and autoscaling setup. With the pre-built Helm chart, you can also deploy this customized NIM container on Amazon EKS, and monitor the deployment using Amazon CloudWatch.

1. Amazon SageMaker Deployment Walkthrough

Using Amazon SageMaker, you can deploy different types of highly available and monitored inference endpoints to consume: near real-time endpoint, asynchronous endpoint for micro-batch inference and large batch transformation jobs. You can use Python SDK Boto3 on a Jupyter notebook, to create a SageMaker near real-time inference endpoint:

endpoint_config_name = model_name + '-config'
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name, 
    ProductionVariants = [
        {
            "VariantName": "AllTraffic",
            "ModelName": model_name, 
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "InitialVariantWeight": 1.0,
            "ContainerStartupHealthCheckTimeoutInSeconds": SG_CONTAINER_STARTUP_TIMEOUT
        }
    ]
)

endpoint_name = model_name + '-endpoint'
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name
)

Or you can add AsyncInferenceConfig in endpoint configuration to create an async inference endpoint:

endpoint_config_name = model_name + '-async-config'
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name, 
    ProductionVariants = [
        {
            "VariantName": "AllTraffic",
            "ModelName": model_name, 
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "InitialVariantWeight": 1.0,
            "ContainerStartupHealthCheckTimeoutInSeconds": SG_CONTAINER_STARTUP_TIMEOUT
        }
    ],
    AsyncInferenceConfig={
        "OutputConfig": {
            "S3OutputPath": f"s3://{bucket}/nim/vista3d/output"
        }
    }
)

endpoint_name = model_name + '-async-endpoint'
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name
)

If you select an asynchronous endpoint, you can autoscale the compute capability down to instances during times of low usage. This helps avoid paying for idle instances, and reduces your costs automatically. You can do this by defining an autoscale policy that permits scaling the SageMaker inference endpoint to zero instances, as follows:

autoscale_client = boto3.client('application-autoscaling')

# This is the format in which application autoscaling references the endpoint
resource_id=f"endpoint/{endpoint_name}/variant/AllTraffic"

# Define and register your endpoint variant
response = autoscale_client.register_scalable_target(
    ServiceNamespace='sagemaker', 
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', # The number of EC2 instances for your Amazon SageMaker model endpoint variant.
    MinCapacity=0,
    MaxCapacity=1
)

You will also need to add a policy to start a new instance for the inference endpoint if there are new requests in the queue:

response = autoscale_client.put_scaling_policy(
    PolicyName="ApproximateBacklogSizePerInstance-ScalingPolicy",
    ServiceNamespace="sagemaker",  # The namespace of the service that provides the resource.
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="TargetTrackingScaling",  # 'StepScaling' or 'TargetTrackingScaling'
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 1.0, # The target value for the metric. Here the metric is: ApproximateBacklogSizePerInstance
        'CustomizedMetricSpecification': {
            'MetricName': 'ApproximateBacklogSizePerInstance',
            'Namespace': 'AWS/SageMaker',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value': endpoint_name }
            ],
            'Statistic': 'Average',
        }
    }        
)

2. Amazon EKS Deployment Walkthrough

We will next walk through deploying the VISTA-3D NIM using Amazon EKS. First clone the Data-on-EKS repo and go to this ai/ml folder with the installation script. Before deployment, use your preferred code editor to change the instance size in eks.tf file for hosting NIM containers to a smaller instance, like a g5.xlarge. Also, change the Amazon EKS cluster name to vistanim-on-eks in the variables.tf file. Change the AWS Region in the same file to where you want to host the inference endpoint, for example, us-east-1.

After you have made these changes, you can deploy the stack by running ./install.sh. When that finishes successfully, configure the kubectl:

aws eks update-kubeconfig --name vistanim-on-eks --region <region>

Now you can use the helm chart created by NVIDIA to deploy the NIM container on this Amazon EKS cluster. Clone the NVIDIA nim-deploy repo and move the helm folder into current folder. Edit the VISTA-3D NIM configuration file to replace the container image repository and tag with the one in your private Amazon ECR. Then deploy the VISTA-3D NIM using the nim-deploy helm chart:

export NGC_API_KEY=<NGC_API_KEY_HERE>
helm --namespace vista install vista helm/nim-llm/ --create-namespace --set model.ngcAPIKey="$NGC_API_KEY" -f vista3d-values.yaml

To check your pods to validate they are up and healthy (Pods should be in a Running state and 1/1 Ready status): kubectl get pods -n vista -o wide

You should see pods in a running state, like in Figure 4.

Figure 4. Screenshot of the terminal running kubectl command line showing running pods inside an EKS cluster

Now you can setup an Application Load Balancer controller ingress to allow traffic to the NIM inference endpoints, by deploying the Application Load Balancer ingress from the ingress.yaml configuration file to expose the VISTA-3D NIM: kubectl apply -f eks/ingress.yaml

You can check the public address of your Application Load Balancer generated from the Application Load Balancer controller: kubectl get ing -n vista -o wide

With the public Application Load Balancer domain address you can now make requests to your VISTA-3D image endpoint:

curl -X POST http://{ALB_ADDRESS_HERE}/vista3d/inference \
-H "Content-Type: application/json" \
--output output.nrrd \
-d '{
    "image": "https://assets.ngc.nvidia.com/products/api-catalog/vista3d/example-1.nii.gz",
    "prompts": {
        "classes": [“lung tumor”,”hepatic tumor”,
"liver", "spleen", "left lung upper lobe","left lung lower lobe","right lung upper lobe","right lung middle lobe", "right lung lower lobe"
]
    }
}'

Using this automated Amazon EKS deployment, you will get container insights for observability out of box, as shown in Figure 5.

Figure 5 – Screenshot of AWS CloudWatch for Container Insights.

Once you are done with all experiments, run the following script to delete the Amazon EKS cluster: ./cleanup.sh

Conclusion

We walked through two ways to deploy the medical imaging NIM from NVIDIA using managed services like Amazon SageMaker, Amazon EKS and HealthImaging. By taking advantage of the automated deployment and built-in features for high availability and observability, you can have a scalable, production ready medical imaging AI system available to be integrated into your medical imaging workflows.

We’d like to acknowledge Ahmed Harouni, who is a Technical Marketing Engineer at Nvidia specializing in deep learning for medical imaging, for his contribution to the creation of this blog’s content.

Contact an AWS Representative to know how we can help accelerate your business.

AWS for Industries