AWS HPC Blog
Benchmarking NVIDIA Clara Parabricks Somatic Variant Calling Pipeline on AWS
This post was contributed by Pankaj Vats, PhD, and Timothy Harkins, PhD, from NVIDIA Parabricks.
Introduction
Advances in Next Generation Sequencing (NGS) technologies over the last decade have led to many novel discoveries in cancer genomics1,2,3. Whole genome, whole exome and targeted-panel sequencing are more widely used today in clinical research and practice, to diagnose cancers and identify the course of treatment for cancer patients based on tumor mutations. This has led to the formation of consortia dedicated to studying cancer genomics at a large scale3,5. As the volume of data increases, the algorithms and workflows used to analyze the data are proving to be the primary bottleneck. NVIDIA Clara Parabricks addresses this issue by providing an accelerated, scalable, and reproducible software suite optimized for genomic analysis on Graphics Processing Units (GPUs).
In a previous blog post, we discussed the germline variant calling tools available in Clara Parabricks. In this post, we will provide an overview of how to perform somatic variant analysis for cancer workflows using Parabricks from the AWS Marketplace. Somatic variants are genetic alterations which are not inherited but acquired during one’s lifespan, for example those that are present in a tumor. In this post, we will demonstrate how to perform somatic variant calling from matched tumor and normal, as well as tumor-only whole genome and whole exome datasets using an NVIDIA GPU-accelerated Parabricks pipeline and compare the results with baseline CPU-based workflows.
Somatic variant calling pipeline overview
The Parabricks somatic variant workflow starts with tumor and normal bam files as input for five state of art somatic variant calling tools Mutect2, MuSE, Somatic Sniper, LoFreq – which are GPU accelerated in the Clara Parabricks software suite – and Strelka2, which is not. The vcf output of these algorithms can be used to generate consensus vcf or intersection vcf for downstream secondary analysis. Parabricks also provides another important feature where users can provide a set of normal samples to create a panel of normal (PON). Mutect2 can be run in tumor-only mode, with or without a panel of normal (PON).
Dataset used
In this post, we will use the ~50x Tumor/Normal short-read human data from the Somatic Mutation Working Group of the SEQC26-9 consortium. The dataset consists of both whole genome and whole exome HCC1395 (triple negative breast cancer cell line) and HCC1395 BL (matched normal cell line), sequenced on multiple platforms with multiple technical replicates. The WGS data had ~55x mean coverage for tumor and normal, and the WES data had ~100x for tumor and 80x for normal sample. We chose this dataset, because it is well characterized, and the consortium provides a Truth set for the somatic variant calls.
Running NVIDIA Clara Parabricks Somatic Workflow
We detailed how to subscribe and use the NVIDIA Clara Parabricks AMI from the AWS Marketplace in the previous post on benchmarking the germline pipeline. We followed the same prerequisites and initial setup from that post up until “Step 3. Download data” where we shifted to leveraging resources for the somatic pipeline benchmark.
We ran 4 variant callers (Mutect2, Somatic Sniper, Muse and LoFreq) on the SEQC2 dataset. The details of the command line options to download the data and run each of these variant callers can be found at this GitHub repository the Clara Parabricks team have provided for this blog post. You can also find a more detailed overview of somatic variant calling tools available in Clara Parabricks within the NVIDIA documentation.
Results
Runtime comparison with respect to the baseline CPU algorithms versus GPU accelerated NVIDIA Clara Parabricks variant callers.
The NVIDIA Clara Parabricks software can run on different NVIDIA GPUs instances like the Amazon EC2 G4, P3, and P4, but here we will focus on the Parabricks runs on g4dn.12xlarge (4 NVIDIA T4 Tensor Core GPUs) runs in comparison to the baseline callers that were run on an m5.8xlarge (32 vCPU) instance.
Runtime comparison of Parabricks versus the CPU-based baseline callers on seqc2 WGS shows a significant speedup ranging from 4x for LoFreq, 6.5x for MuSE, 9x for Somatic Sniper and 42x acceleration for Mutect2 (tumor/normal mode) relative to the baseline callers detailed runtime for each caller is represented in Figure 2. If users want to run tumor-only somatic variant calling, Parabricks provides the ability to do so with 56x acceleration relative to Mutect2 baseline caller (Figure 2).
Concordance analysis
The concordance analysis of the output VCFs generated from the NVIDIA Clara Parabricks somatic callers to the baseline somatic callers was performed using GATK Concordance tools. The SNV recall and precision performance was performed using the baseline caller output vcf files as truth set and evaluating the vcf file generated using NVIDIA Clara Parabricks represented in (Figure 3). The concordance (F1-score) was 100% for MuSE, LoFreq and Somatic Sniper followed by Mutect2 tumor-only mode and Mutect2 tumor/normal mode 99.99% and 99.96% respectively. Similarly, for INDELS we achieved F1-score 99.99% and 99.94% for Mutect2 tumor-only mode and Mutect2 tumor/normal mode respectively (Figure 4). The reason for the difference in the Mutect2 results is due to the underlying algorithm that uses random number generation, resulting in a non-deterministic output. In contrast, NVIDIA Clara Parabricks is focused on providing more deterministic results on different hardware platforms.
Cost of running somatic variant calling using NVIDIA Clara Parabricks
In-addition to acceleration and accuracy, NVIDIA Clara Parabricks also provides the advantage of minimizing the processing cost per sample. Overall, Parabricks demonstrates a reduction in cost ranging from 2-fold for LoFreq to 16-fold for Mutect2 tumor-only run. In addition to being faster and highly concordant with the corresponding baseline variant callers, Parabricks is also significantly cheaper on a per-sample basis, making it more feasible to perform large-scale exome/genome studies (Figure5).
All the WGS/WES analysis was performed in the AWS us-east-1 region, pricing in other regions may vary. The costs below are based on EC2 instance runtimes (billed per second), and do not include the NVIDIA Parabricks AMI license cost (billed per hour). As of the time of publishing, the license cost is $0.30/hour in all Regions. Including licensing, Clara Parabricks is still significantly cheaper to use. For example, the all-up cost of running MuSE is $15.03 compared to the unaccelerated costs of >$37.
Similarly, the runtime, performance, and cost estimation numbers for the exome dataset are represented in Figure 6, Figure 7 and Figure8.
Conclusion
The Next Generation Sequencing (NGS) bioinformatics workflows for somatic variant calling using baseline callers are slow and have become the computational bottleneck as the sequencing throughput increases. Here, NVIDIA Clara Parabricks takes advantage of NVIDIA’s GPU architecture and provides a comprehensive, fast, and cost-effective somatic variant analysis solution. Parabricks provides 4x to 56x acceleration compared to the baseline callers. Parabricks’ somatic variant calling pipeline exhibits a huge advantage over the conventional baseline pipeline by providing a faster, scalable, reproducible, and cost-effective solution without compromising on the accuracy in comparison to standard pipelines. We will continue to expand the bioinformatics tools in Parabricks. NVIDIA provides detailed documentation on Parabricks at its website. To get started using Parabricks on AWS, please visit the offering in AWS Marketplace.
The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.
References:
- Wu YM, Su F, Kalyana-Sundaram S, Khazanov N, Ateeq B, Cao X, Lonigro RJ, Vats P, Wang R, Lin SF, Cheng AJ, Kunju LP, Siddiqui J, Tomlins SA, Wyngaard P, Sadis S, Roychowdhury S, Hussain MH, Feng FY, Zalupski MM, Talpaz M, Pienta KJ, Rhodes DR, Robinson DR, Chinnaiyan AM. Identification of targetable FGFR gene fusions in diverse cancers. Cancer Discov. 2013 Jun;3(6):636-47. doi: 10.1158/2159-8290.CD-13-0050. Epub 2013 Apr 4. PMID: 23558953; PMCID: PMC3694764.
- Robinson D, Van Allen EM, Wu YM, Schultz N, Lonigro RJ, Mosquera JM, Montgomery B, Taplin ME, Pritchard CC, Attard G, Beltran H, Abida W, Bradley RK, Vinson J, Cao X, Vats P, Kunju LP, Hussain M, Feng FY, Tomlins SA, Cooney KA, Smith DC, Brennan C, Siddiqui J, Mehra R, Chen Y, Rathkopf DE, Morris MJ, Solomon SB, Durack JC, Reuter VE, Gopalan A, Gao J, Loda M, Lis RT, Bowden M, Balk SP, Gaviola G, Sougnez C, Gupta M, Yu EY, Mostaghel EA, Cheng HH, Mulcahy H, True LD, Plymate SR, Dvinge H, Ferraldeschi R, Flohr P, Miranda S, Zafeiriou Z, Tunariu N, Mateo J, Perez-Lopez R, Demichelis F, Robinson BD, Schiffman M, Nanus DM, Tagawa ST, Sigaras A, Eng KW, Elemento O, Sboner A, Heath EI, Scher HI, Pienta KJ, Kantoff P, de Bono JS, Rubin MA, Nelson PS, Garraway LA, Sawyers CL, Chinnaiyan AM. Integrative clinical genomics of advanced prostate cancer. Cell. 2015 May 21;161(5):1215-1228. doi: 10.1016/j.cell.2015.05.001. Erratum in: Cell. 2015 Jul 16;162(2):454. PMID: 26000489; PMCID: PMC4484602.
- Hutter C, Zenklusen JC. The Cancer Genome Atlas: Creating Lasting Value beyond Its Data. Cell. 2018 Apr 5;173(2):283-285. doi: 10.1016/j.cell.2018.03.042. PMID: 29625045.
- Heimlich JB, Bick AG. Somatic Mutations in Cardiovascular Disease. Circ Res. 2022 Jan 7;130(1):149-161. doi: 10.1161/CIRCRESAHA.121.319809. Epub 2022 Jan 7. PMID: 34995138.
- ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature. 2020 Feb;578(7793):82-93. doi: 10.1038/s41586-020-1969-6. Epub 2020 Feb 5. PMID: 32025007; PMCID: PMC7025898.
- SEQC2 Consortium data: https://sites.google.com/view/seqc2/home
- (https://docs.nvidia.com/clara/parabricks/v3.6/text/publications_list.html )
- Fang LT, Zhu B, Zhao Y, Chen W, Yang Z, Kerrigan L, Langenbach K, de Mars M, Lu C, Idler K, et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nature Biotechnology. 2021;39(9):1151-1160 / PMID:34504347 / SharedIt Link
- Xiao W, Ren L, Chen Z, Fang LT, Zhao Y, Lack J, Guan M, Zhu B, Jaeger E, Kerrigan L, et al. Toward best practice in cancer mutation detection with whole-genome and whole-exome sequencing. Nature Biotechnology. 2021;39(9):1141-1150 / PMID:34504346 / SharedIt Link
- Zhao Y, Fang LT, Shen T, Choudhari S, Talsania K, Chen X, Shetty J, Kriga Y, Tran B, Zhu B, et al. Whole genome and exome sequencing reference datasets from a multi-center and cross-platform benchmark study. Scientific Data. 2021;8(1):296