AWS Public Sector Blog
Operationalizing SAS on AWS with the American College of Radiology
The American College of Radiology (ACR) is a nonprofit organization and world leader supporting the radiology community. Founded in 1923, ACR represents more than 41,000 diagnostic and interventional radiologists, radiation oncologists, nuclear medicine physicians, and medical physicists.
In this post, we share how ACR implemented a Statistical Analysis System (SAS) solution on Amazon Web Services (AWS) to accomplish their big data processing and analysis needs, resulting in better cost optimization, performance improvements, and improved scalability.
The on-premises solution and its challenges
One of ACR’s core tasks is ingesting, processing, delivering, and reporting imaging and clinical radiology datasets. Before implementing the solution on AWS, ACR was managing an on-premises solution, which consisted of a large SAS server and a storage area network (SAN). This solution was sufficient for processing about 3 terabytes (TB) of data within 6–8 hours. However, the ACR team gained access to more than 100 TB of additional data that needed to be ingested, cataloged, and analyzed and thus needed a more scalable solution.
To handle this growth in volume on-premises, ACR would have required additional physical servers and SAN in their existing infrastructure, as shown in Figure 1, increasing both the cost and operational overhead of the solution. Their data analysis needs consisted of two processes: cohort creation and analysis. The cohort creation process requires all data to be scanned longitudinally and requires significant compute resources. It is typically done up to five times per project and analyst. The ongoing analysis process is much less compute-intensive than cohort creation. This means that in an on-premises solution, ACR would only use their full processing capacity for cohort creation a few times per month. The rest of the month, they would have been over-provisioned.
Given these scaling considerations and a need to optimize costs, ACR worked with AWS to design and implement a cloud-based SAS solution.
Solution overview
ACR’s solution on AWS relies on several core storage and analytics services. As an entry point to the architecture, data is uploaded as .csv files into an Amazon Simple Storage Service (Amazon S3) bucket for raw data. An AWS Glue crawler then creates a catalog of the data to support analytics processes. After the data is cataloged, a final AWS Glue job is run to transform the raw .csv data into Parquet format, making it more efficient to perform analytics. ACR’s Neiman Health Policy Institute team converted the scripts they had been using on-premises to SQL queries and used Amazon Athena to run serverless queries on the data.
Since the team was using Athena for the first time, they were worried about scanning too much data and incurring unexpected costs. To minimize that concern, they created three workgroups in Athena based on the team’s query requirements. The default workgroup had a limit of 5 gigabytes (GB) data scan in order to prevent large, costly queries by mistake.
Due to the large scale of ACR’s data, the AWS Glue jobs initially failed with the error message, “no space left on disk.” Partitioning and bounded execution allowed ACR’s AWS Glue jobs to scale without error. Bounded execution helped them process hundreds of TBs of data seamlessly by limiting the number of files AWS Glue would process during the executed job run.
The following architecture diagram illustrates the workflow to implement the solution.
Benefits of the new SAS solution
Migrating the workload from on-premises to AWS Cloud services provided cost savings, better performance, and additional flexibility and scalability for ACR. Because ACR used AWS services with a pay-as-you-go pricing model, they avoided costly on-premises hardware upgrades. ACR estimates that this resulted in a cost savings of hundreds of thousands of dollars and achieved a threefold cost savings overall compared to on-premises.
Using AWS Glue to convert data to Parquet format resulted in an almost tenfold compression of data. This approach led to ACR substantially reducing storage costs. After implementing the new AWS solution, ACR benefited from a threefold reduction in total cost in comparison to their previous on-premises solution.
By implementing this compression strategy and integrating Amazon Athena for querying, ACR experienced faster query performance due to the Parquet columnar structure reading only the necessary columns. This design led to almost a hundred times more efficient performance, reducing the most intensive query times from 1.5 hours to less than a minute.
Conclusion
The American College of Radiology’s migration from an on-premises SAS infrastructure to AWS helped them reduce cost immensely while increasing their system performance. ACR hopes that their experience in migrating SAS to AWS can serve as a guide for other organizations seeking similar results.
In this post, we detailed Phase 1 of ACR’s journey in modernizing their analytics processes. Going forward, they plan to implement a Phase 2 strategy. This new phase will continue to utilize AWS Glue for the filtering and cohort stage. R will now be explored, and Amazon SageMaker R Kernel will be used for filtered data access through AWS APIs. Overall, ACR anticipates that Phase 2 will have the potential to achieve a fivefold cost reduction over the savings they previously achieved.