Help with performance from Parabricks on SLURM HPC cluster using Snakemake

Hi,

I am trying to align Illumina 30x paired-end FASTQ files to CHM13v2.0 (T2T) reference. I am using Parabricks v4.4.0. According to NVIDIA and some benchmarks out there, it should be possible to analyze 30x data in under an hour. However, when I run it on my university’s HPC, it takes too long to complete and the GPUs appear to be underutilized.

  • I am using Snakemake’s SLURM executor plugin to submit the jobs, using Snakemake’s apptainer integration to run in the Parabricks Docker container (see below).
  • I followed NVIDIA’s guide for best performance for fq2bam.
  • I request 4 A100 GPUs, 196GB CPU memory, and 32 CPU threads. According to this benchmark and this one too, 4 A100 GPUs can run the germline pipeline in a little over an hour.

However, when I run fq2bam and come back after 3-5 hours, it is still in Sorting Phase-I. I also received a warning from the HPC admin about GPU underutilization (see attached image - is this normal?)

Any idea why I am not getting good performance? Why could it be taking so long when I run it?


Job rule:

rule run_parabricks_alignment:
    input:
        fq_1=f"{work_dir}/inputs/{{sample}}_R1.fastq.gz",
        fq_2=f"{work_dir}/inputs/{{sample}}_R2.fastq.gz",
    output:
        f"{work_dir}/outputs/{{sample}}_markdup.bam",
    log:
        f"{work_dir}/logs/run_parabricks_alignment/{{sample}}.log",
    container:
        "docker://nvcr.io/nvidia/clara/clara-parabricks:4.4.0-1"
    params:
        reference=f"{work_dir}/resources/reference/chm13v2.0.fa",
        pixel_distance=2500,
        out_dir=f"{work_dir}/outputs",
    shell:
        "(pbrun fq2bam "
        "--ref {params.reference} "
        "--in-fq {input.fq_1} {input.fq_2} "
        "--out-bam {output} "
        "--out-duplicate-metrics {params.out_dir}/{wildcards.sample}_duplicate-metrics.txt "
        "--out-qc-metrics-dir {params.out_dir}/{wildcards.sample}_qc-metrics "
        f"--tmp-dir {work_dir}/tmp "
        "--bwa-options='-M' "
        "--fix-mate "
        "--optical-duplicate-pixel-distance {params.pixel_distance} "
        "--gpusort --gpuwrite "
        ")2> {log}"

Snakemake parameters:

jobs: 10
executor: slurm
use-conda: true
use-apptainer: true

set-resources:
  run_parabricks_alignment:
    slurm_extra: "'--gpus=4' '--cpus-per-task=32' '--qos=normal'"
    slurm_partition: "gpu-a100"
    mem: "196GB"
    time: 480