Hi,
I am trying to align Illumina 30x paired-end FASTQ files to CHM13v2.0 (T2T) reference. I am using Parabricks v4.4.0. According to NVIDIA and some benchmarks out there, it should be possible to analyze 30x data in under an hour. However, when I run it on my university’s HPC, it takes too long to complete and the GPUs appear to be underutilized.
- I am using Snakemake’s SLURM executor plugin to submit the jobs, using Snakemake’s apptainer integration to run in the Parabricks Docker container (see below).
- I followed NVIDIA’s guide for best performance for fq2bam.
- I request 4 A100 GPUs, 196GB CPU memory, and 32 CPU threads. According to this benchmark and this one too, 4 A100 GPUs can run the germline pipeline in a little over an hour.
However, when I run fq2bam
and come back after 3-5 hours, it is still in Sorting Phase-I. I also received a warning from the HPC admin about GPU underutilization (see attached image - is this normal?)
Any idea why I am not getting good performance? Why could it be taking so long when I run it?
Job rule:
rule run_parabricks_alignment:
input:
fq_1=f"{work_dir}/inputs/{{sample}}_R1.fastq.gz",
fq_2=f"{work_dir}/inputs/{{sample}}_R2.fastq.gz",
output:
f"{work_dir}/outputs/{{sample}}_markdup.bam",
log:
f"{work_dir}/logs/run_parabricks_alignment/{{sample}}.log",
container:
"docker://nvcr.io/nvidia/clara/clara-parabricks:4.4.0-1"
params:
reference=f"{work_dir}/resources/reference/chm13v2.0.fa",
pixel_distance=2500,
out_dir=f"{work_dir}/outputs",
shell:
"(pbrun fq2bam "
"--ref {params.reference} "
"--in-fq {input.fq_1} {input.fq_2} "
"--out-bam {output} "
"--out-duplicate-metrics {params.out_dir}/{wildcards.sample}_duplicate-metrics.txt "
"--out-qc-metrics-dir {params.out_dir}/{wildcards.sample}_qc-metrics "
f"--tmp-dir {work_dir}/tmp "
"--bwa-options='-M' "
"--fix-mate "
"--optical-duplicate-pixel-distance {params.pixel_distance} "
"--gpusort --gpuwrite "
")2> {log}"
Snakemake parameters:
jobs: 10
executor: slurm
use-conda: true
use-apptainer: true
set-resources:
run_parabricks_alignment:
slurm_extra: "'--gpus=4' '--cpus-per-task=32' '--qos=normal'"
slurm_partition: "gpu-a100"
mem: "196GB"
time: 480