Let's run germline pipeline + fq2bamfast on RTX 4090 24GB VRAM with 24 cores CPU and 64GB RAM

khamenya · June 7, 2024, 6:24am

First of all good news:

germline pipeline runs on RTX 4090 using plain invocation from docs!

It takes 4 minutes on Parabricks sample data from tutorial and
it takes 1 hour 40 minutes on DanteLabs 30x WGS (hooman being DNA).

But I fail to enable --fq2bamfast as a part of pbrun germline invocation. I tried many reasonable changes to the example invocation from docs to no avail.

Let’s start from minimal change (--run-partition is not possible with 1 GPU) to the example invocation from docs:

#!/usr/bin/bash -x

REFERENCE_FILE=01-ref/assemblies/hg38/Homo_sapiens_assembly38.fasta
KNOWN_SITES_FILE=01-ref/assemblies/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
#INPUT_FASTQ_1=00-raw/me/GFXC959360_SA_L001_R1_001.fastq.gz
INPUT_FASTQ_1=00-raw/parabricks_sample/sample_1.fq.gz
INPUT_FASTQ_2=00-raw/parabricks_sample/sample_2.fq.gz
OUTPUT_BAM=parabricks-germline.bam
OUTPUT_VCF=parabricks-germline-variants.vcf
OUT_RECAL_FILE=parabricks-germline-recal.table

time docker run --rm --gpus all --volume /mnt/dna/DanteLabs/data:/workdir --volume /mnt/dna/DanteLabs/nvidia/parabrick_sample/:/outputdir \
    --workdir /workdir \
    --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
    nvcr.io/nvidia/clara/clara-parabricks:4.3.1-1 \
    pbrun germline \
    --verbose \
    --ref /workdir/${REFERENCE_FILE} \
    --in-fq /workdir/${INPUT_FASTQ_1} /workdir/${INPUT_FASTQ_2} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --tmp-dir /workdir/tmp \
    --num-cpu-threads-per-stage 1 \
    --bwa-cpu-thread-pool 1 \
    --out-variants /outputdir/${OUTPUT_VCF} \
    --read-from-tmp-dir \
    --gpusort \
    --gpuwrite \
    --fq2bamfast \
    --keep-tmp

this gives out of memory error on CUDA side:

+ REFERENCE_FILE=01-ref/assemblies/hg38/Homo_sapiens_assembly38.fasta
+ KNOWN_SITES_FILE=01-ref/assemblies/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
+ INPUT_FASTQ_1=00-raw/parabricks_sample/sample_1.fq.gz
+ INPUT_FASTQ_2=00-raw/parabricks_sample/sample_2.fq.gz
+ OUTPUT_BAM=parabricks-germline.bam
+ OUTPUT_VCF=parabricks-germline-variants.vcf
+ OUT_RECAL_FILE=parabricks-germline-recal.table
+ docker run --rm --gpus all --volume /mnt/dna/DanteLabs/data:/workdir --volume /mnt/dna/DanteLabs/nvidia/parabrick_sample/:/outputdir --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 nvcr.io/nvidia/clara/clara-parabricks:4.3.1-1 pbrun germline --verbose --ref /workdir/01-ref/assemblies/hg38/Homo_sapi
ens_assembly38.fasta --in-fq /workdir/00-raw/parabricks_sample/sample_1.fq.gz /workdir/00-raw/parabricks_sample/sample_2.fq.gz --out-bam /outputdir/parabricks-germline.bam --tmp-dir /workdir/tmp --num-cpu-threads-per-stage 1 --bwa-cpu-thread-pool 1 --out-variants /outputdir/parabricks-germline-variants.vcf --read-from-tmp-di
r --gpusort --gpuwrite --fq2bamfast --keep-tmp
Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation


[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /workdir/00-raw/parabricks_sample/sample_1.fq.gz and
/workdir/00-raw/parabricks_sample/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1

[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Read group created for /workdir/00-raw/parabricks_sample/sample_1.fq.gz and
/workdir/00-raw/parabricks_sample/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[PB Info 2024-Jun-07 06:19:58] ------------------------------------------------------------------------------
[PB Info 2024-Jun-07 06:19:58] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2024-Jun-07 06:19:58] ||                              Version 4.3.1-1                             ||
[PB Info 2024-Jun-07 06:19:58] ||                      GPU-PBBWA mem, Sorting Phase-I                      ||
[PB Info 2024-Jun-07 06:19:58] ------------------------------------------------------------------------------
[PB Debug 2024-Jun-07 06:19:58][src/main.cpp:235] Read in 1 fq pairs
[PB Info 2024-Jun-07 06:19:58] Mode = pair-ended-gpu
[PB Info 2024-Jun-07 06:19:58] Running with 1 GPU(s), using 4 stream(s) per device with 1 worker threads per GPU
[PB Debug 2024-Jun-07 06:20:00][src/internal/bwa_lib_context.cu:214] handle is created and associated with GPU ID: 0
[PB Debug 2024-Jun-07 06:20:00][src/internal/bwa_lib_context.cu:214] handle is created and associated with GPU ID: 0
[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:522] Using new-er seeding implementation
[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:535] Phase 1: allocatedSize %d3290643072
[PB Debug 2024-Jun-07 06:20:00][src/internal/chain_gpu.cu:41] Phase 2 allocated size 2455466512[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:522] Using new-er seeding implementation

[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:535] Phase 1: allocatedSize %d3290643072
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:119] Partial %d800262144
[PB Debug 2024-Jun-07 06:20:00][src/internal/chain_gpu.cu:41] Phase 2 allocated size 2455466512
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:119] Partial %d800262144
[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:522] Using new-er seeding implementation
[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:535] Phase 1: allocatedSize %d3290643072
[PB Debug 2024-Jun-07 06:20:00][src/internal/chain_gpu.cu:41] Phase 2 allocated size 2455466512
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:119] Partial %d800262144
[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:522] Using new-er seeding implementation
[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:535] Phase 1: allocatedSize %d3290643072
[PB Debug 2024-Jun-07 06:20:00][src/internal/chain_gpu.cu:41] Phase 2 allocated size 2455466512
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:119] Partial %d800262144
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:134] Left/Right %d2720525312
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:134] Left/Right %d2720525312
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:138] RMAX %d3040525312
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:138] RMAX %d3040525312
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:145] Buf %d3706371072
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:145] Buf %d3706371072
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:173] Phase 3 allocatedSize %d3989748800
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:173] Phase 3 allocatedSize %d3989748800
[PB Debug 2024-Jun-07 06:20:00][src/internal/sam_gpu.cu:104] Phase 4 allocatedSize %d3989615664
[PB Debug 2024-Jun-07 06:20:00][src/internal/sam_gpu.cu:104] Phase 4 allocatedSize %d3989615664
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:134] Left/Right %d2720525312
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:134] Left/Right %d2720525312
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:138] RMAX %d3040525312
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:138] RMAX %d3040525312
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:145] Buf %d3706371072
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:173] Phase 3 allocatedSize %d3989748800
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:145] Buf %d3706371072
[PB Debug 2024-Jun-07 06:20:00][src/internal/sam_gpu.cu:104] Phase 4 allocatedSize %d3989615664
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:173] Phase 3 allocatedSize %d3989748800
[PB Debug 2024-Jun-07 06:20:00][src/internal/sam_gpu.cu:104] Phase 4 allocatedSize %d3989615664
[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:522] Using new-er seeding implementation
[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:535] Phase 1: allocatedSize %d3290643072
[PB Debug 2024-Jun-07 06:20:00][src/internal/chain_gpu.cu:41] Phase 2 allocated size 2455466512
[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:522] Using new-er seeding implementation
[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:535] Phase 1: allocatedSize %d3290643072
[PB Debug 2024-Jun-07 06:20:00][src/internal/chain_gpu.cu:41] Phase 2 allocated size 2455466512
[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:522] Using new-er seeding implementation
[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:535] Phase 1: allocatedSize %d3290643072
[PB Debug 2024-Jun-07 06:20:00][src/internal/chain_gpu.cu:41] Phase 2 allocated size 2455466512
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:119] Partial %d800262144
[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:522] Using new-er seeding implementation
[PB Debug 2024-Jun-07 06:20:00][src/internal/seed_gpu.cu:535] [PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:119] Partial %d800262144
Phase 1: allocatedSize %d3290643072
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:134] [PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:119] Partial %d800262144
Left/Right %d2720525312
[PB Debug 2024-Jun-07 06:20:00][src/internal/chain_gpu.cu:41] Phase 2 allocated size 2455466512
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:134] Left/Right %d2720525312[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:138] RMAX %d3040525312

[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:134] Left/Right %d2720525312
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:119] Partial %d800262144
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:145] Buf %d3706371072
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:138] [PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:134] Left/Right %d2720525312[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:173] [PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:138] RMAX %d3040525312
RMAX %d3040525312
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:145] Phase 3 allocatedSize %d3989748800

Buf %d3706371072
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:145] Buf %d3706371072
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:173] Phase 3 allocatedSize %d3989748800
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:173] Phase 3 allocatedSize %d3989748800
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:138] RMAX %d3040525312
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:145] Buf %d3706371072
[PB Debug 2024-Jun-07 06:20:00][src/internal/reg_gpu.cu:173] Phase 3 allocatedSize %d3989748800
[PB Debug 2024-Jun-07 06:20:00][src/internal/sam_gpu.cu:104] Phase 4 allocatedSize %d3989615664[PB Debug 2024-Jun-07 06:20:00][src/internal/sam_gpu.cu:104] [PB Debug 2024-Jun-07 06:20:00][src/internal/sam_gpu.cu:104] Phase 4 allocatedSize %d3989615664

Phase 4 allocatedSize %d3989615664
[PB Debug 2024-Jun-07 06:20:00][src/internal/sam_gpu.cu:104] Phase 4 allocatedSize %d3989615664
[PB Error 2024-Jun-07 06:20:00][src/internal/gpu_check_error.cu:22] cudaSafeCall() failed at src/internal/gpu_bwa_handle.cu/56: out of memory, exiting.
For technical support visit https://docs.nvidia.com/clara/index.html#parabricks
Exiting...


Could not run fq2bamfast as part of germline pipeline
Exiting pbrun ...

Any advice?

UPDATES:

if fq2bamfast invoked directly (stand alone) it succeeds with --bwa-nstreams 2 (with 3 it still fails)
so, I went back to pbrun germline with --bwa-nstreams 2:

...
[PB Debug 2024-Jun-07 07:13:45][src/internal/reg_gpu.cu:119] Partial %d800262144
[PB Debug 2024-Jun-07 07:13:45][src/internal/reg_gpu.cu:119] Partial %d800262144
[PB Debug 2024-Jun-07 07:13:45][src/internal/reg_gpu.cu:134] Left/Right %d2720525312
[PB Debug 2024-Jun-07 07:13:45][src/internal/reg_gpu.cu:134] Left/Right %d2720525312
[PB Debug 2024-Jun-07 07:13:45][src/internal/reg_gpu.cu:138] RMAX %d3040525312
[PB Debug 2024-Jun-07 07:13:45][src/internal/reg_gpu.cu:138] RMAX %d3040525312
[PB Debug 2024-Jun-07 07:13:45][src/internal/reg_gpu.cu:145] Buf %d3706371072
[PB Debug 2024-Jun-07 07:13:45][src/internal/reg_gpu.cu:145] Buf %d3706371072
[PB Debug 2024-Jun-07 07:13:45][src/internal/reg_gpu.cu:173] Phase 3 allocatedSize %d3989748800
[PB Debug 2024-Jun-07 07:13:45][src/internal/reg_gpu.cu:173] Phase 3 allocatedSize %d3989748800
[PB Debug 2024-Jun-07 07:13:45][src/internal/sam_gpu.cu:104] [PB Debug 2024-Jun-07 07:13:45][src/internal/sam_gpu.cu:104] Phase 4 allocatedSize %d3989615664
Phase 4 allocatedSize %d3989615664
[PB Error 2024-Jun-07 07:13:45][-unknown-:0] Received signal: 11

fq2bamfast also succeeds if started stand alone with --low-memory provided. Remarkably, this option overrides settings provided by --bwa-nstreams and forces it to be 1.
interestingly, --bwa-nstreams 2 runs the same fast as --low-memory (which forces --bwa-nstreams 1) 😳
neither --bwa-nstreams 1 nor --low-memory help to revive fq2bamfast within pbrun germline call.
btw, the peak performance (4.6G bp/minute) of fq2bamfast on my hardware (see topic title) is reached around/between --bwa-cpu-thread-pool 48 and 64. Surprisingly there is no difference between --bwa-nstreams 2 and 1

dpuleri · June 11, 2024, 1:55pm

That makes sense. A 4090 has too little GPU memory to support many streams. We aim to have 1 stream supported with 16GB of GPU memory to support GPUs like the T4.
It should be the same stand alone as it would be in the pipeline format. Can you give the full commands for your stand alone run and the one where you pass --bwa-nstreams # to germline pipeline? Can you also lower the --bwa-cpu-thread-pool to 16 or less? It looks like a segmentation fault and that would be in host memory so perhaps it is running out of host memory. Fewer threads can influence it to use less memory.
That is intended behavior as the number of streams is the largest contributor to GPU memory usage.
You would see more of a difference on datacenter GPUs like A100 or H100.

khamenya · June 11, 2024, 7:06pm

say,12 Gb per stream (or perhaps even just a bit less) would be much better for customers with RTX 4090 and its 24GB

or even better if it would have been a command line option! ;)

yes, sure. For example here is the invocation that takes 1m30s on sample data from tutorial:

docker run --rm --gpus all --volume ./data:/workdir --volume ./parabrick_sample/:/outputdir \
    --workdir /workdir \
    nvcr.io/nvidia/clara/clara-parabricks:4.3.1-1 \
    pbrun fq2bamfast \
    --verbose \
    --ref /workdir/${REFERENCE_FILE} \
    --in-fq /workdir/${INPUT_FASTQ_1} /workdir/${INPUT_FASTQ_2} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --bwa-nstreams 1 \
    --bwa-cpu-thread-pool 64\
    --gpuwrite \
    --gpusort

notice --bwa-cpu-thread-pool 64, values between 48 and 64 work just fine reaching the highest 6.1 Gbp/s throughput on the mentioned PC configuration.

and the other one that crashes:

docker run --rm --gpus all --volume ./data:/workdir --volume ./parabrick_sample/:/outputdir \
    --workdir /workdir \
    --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
    nvcr.io/nvidia/clara/clara-parabricks:4.3.1-1 \
    pbrun germline \
    --verbose \
    --ref /workdir/${REFERENCE_FILE} \
    --in-fq /workdir/${INPUT_FASTQ_1} /workdir/${INPUT_FASTQ_2} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --tmp-dir /workdir/tmp \
    --num-cpu-threads-per-stage 1 \
    --bwa-cpu-thread-pool 1 \
    --bwa-nstreams 1 \
    --out-variants /outputdir/${OUTPUT_VCF} \
    --read-from-tmp-dir \
    --gpusort \
    --gpuwrite \
    --fq2bamfast \
    --low-memory \
    --keep-tmp

and the output log:

+ docker run --rm --gpus all --volume ./data:/workdir --volume ./parabrick_sample/:/outputdir --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 nvcr.io/nvidia/clara/clara-parabricks:4.3.1-1 pbrun germline --verbose --ref /workdir/01-ref/assemblies/hg38/Homo_sapiens_assembly38.fasta --in-fq /workdir/00-raw/parabricks_sample/sample_1.fq.gz /workdir/00-raw/parabricks_sample/sample_2.fq.gz --out-bam /outputdir/parabricks-germline.bam --tmp-dir /workdir/tmp --num-cpu-threads-per-stage 1 --bwa-cpu-thread-pool 1 --bwa-nstreams 1 --out-variants /outputdir/parabricks-germline-variants.vcf --read-from-tmp-dir --gpusort --gpuwrite --fq2bamfast --low-memory --keep-tmp
Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation


[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /workdir/00-raw/parabricks_sample/sample_1.fq.gz and
/workdir/00-raw/parabricks_sample/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1

[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Read group created for /workdir/00-raw/parabricks_sample/sample_1.fq.gz and
/workdir/00-raw/parabricks_sample/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[Parabricks Options Mesg]: Using --low-memory sets the number of streams in bwa mem to 1.
[PB Info 2024-Jun-11 18:58:58] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 18:58:58] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2024-Jun-11 18:58:58] ||                              Version 4.3.1-1                             ||
[PB Info 2024-Jun-11 18:58:58] ||                      GPU-PBBWA mem, Sorting Phase-I                      ||
[PB Info 2024-Jun-11 18:58:58] ------------------------------------------------------------------------------
[PB Debug 2024-Jun-11 18:58:58][src/main.cpp:235] Read in 1 fq pairs
[PB Info 2024-Jun-11 18:58:58] Mode = pair-ended-gpu
[PB Info 2024-Jun-11 18:58:58] Running with 1 GPU(s), using 1 stream(s) per device with 1 worker threads per GPU
[PB Debug 2024-Jun-11 18:59:00][src/internal/bwa_lib_context.cu:214] handle is created and associated with GPU ID: 0
[PB Debug 2024-Jun-11 18:59:00][src/internal/bwa_lib_context.cu:214] handle is created and associated with GPU ID: 0
[PB Debug 2024-Jun-11 18:59:01][src/internal/seed_gpu.cu:522] Using new-er seeding implementation
[PB Debug 2024-Jun-11 18:59:01][src/internal/seed_gpu.cu:535] Phase 1: allocatedSize %d3290643072
[PB Debug 2024-Jun-11 18:59:01][src/internal/chain_gpu.cu:41] Phase 2 allocated size 2455466512
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:119] Partial %d800262144
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:134] Left/Right %d2720525312
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:138] RMAX %d3040525312
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:145] Buf %d3706371072
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:173] Phase 3 allocatedSize %d3989748800
[PB Debug 2024-Jun-11 18:59:01][src/internal/sam_gpu.cu:104] Phase 4 allocatedSize %d3989615664
[PB Debug 2024-Jun-11 18:59:01][src/internal/seed_gpu.cu:522] Using new-er seeding implementation
[PB Debug 2024-Jun-11 18:59:01][src/internal/seed_gpu.cu:535] Phase 1: allocatedSize %d3290643072
[PB Debug 2024-Jun-11 18:59:01][src/internal/chain_gpu.cu:41] Phase 2 allocated size 2455466512
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:119] Partial %d800262144
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:134] Left/Right %d2720525312
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:138] RMAX %d3040525312
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:145] Buf %d3706371072
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:173] Phase 3 allocatedSize %d3989748800
[PB Debug 2024-Jun-11 18:59:01][src/internal/sam_gpu.cu:104] Phase 4 allocatedSize %d3989615664
[PB Error 2024-Jun-11 18:59:01][-unknown-:0] Received signal: 11

well, I’m a non-enterprise customer and my interests are private.
For a few runs of “DIY” WGS-analysis A100 or H100 is quite an overkill.

when --fq2bamfast will become running within germline pipeline, the whole pbrun germline invocation will take ~50 min on RTX 4090, which is quite nice.

dpuleri · June 11, 2024, 7:23pm

Could you try germline with --bwa-cpu-thread-pool 32? Two more things: take off --verbose and add --x3. The latter will show more details about the configuration of the run and verbose is adding too much noise right now.

khamenya · June 11, 2024, 7:38pm

@dpuleri woa! it ran successfully in 4m35s on sample data from tutorial, thank you!

btw, I guess it is a bug though and needs to be fixed :)

Now I’d try it on my real WGS data :)

dpuleri · June 11, 2024, 7:46pm

Awesome. Glad it worked out.

It is more a configuration issue. We try to pick the best presets to run on most systems fast, but it is tricky. I agree that we should give a nicer error other than a seg fault for that though since it was truly running out of memory.

Basically there needs to be a balance between the number of CPU threads and GPU resources as it is a big pipeline and if the CPU or GPU is too slow then queues can keep growing. We have limits but they may be too large by default for desktop systems.

khamenya · June 11, 2024, 7:49pm

btw when i tried --bwa-cpu-thread-pool 64 to reach the highest Gpb/s throughput it crashed only on the markdup phase:

+ REFERENCE_FILE=01-ref/assemblies/hg38/Homo_sapiens_assembly38.fasta
+ KNOWN_SITES_FILE=01-ref/assemblies/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
+ INPUT_FASTQ_1=00-raw/parabricks_sample/sample_1.fq.gz
+ INPUT_FASTQ_2=00-raw/parabricks_sample/sample_2.fq.gz
+ OUTPUT_BAM=parabricks-germline.bam
+ OUTPUT_VCF=parabricks-germline-variants.vcf
+ OUT_RECAL_FILE=parabricks-germline-recal.table
+ docker run --rm --gpus all --volume ./data:/workdir --volume ./nvidia/parabrick_sample/:/outputdir --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 nvcr.io/nvidia/clara/clara-parabricks:4.3.1-1 pbrun germline --x3 --ref /workdir/01-ref/assemblies/hg38/Homo_sapiens_assembly38.fasta --in-fq /workdir/00-raw/parabricks_sample/sample_1.fq.gz /workdir/00-raw/parabricks_sample/sample_2.fq.gz --out-bam /outputdir/parabricks-germline.bam --tmp-dir /workdir/tmp --num-cpu-threads-per-stage 1 --bwa-cpu-thread-pool 64 --bwa-nstreams 1 --out-variants /outputdir/parabricks-germline-variants.vcf --read-from-tmp-dir --gpusort --gpuwrite --fq2bamfast --keep-tmp
Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation


[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /workdir/00-raw/parabricks_sample/sample_1.fq.gz and
/workdir/00-raw/parabricks_sample/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1

/usr/local/parabricks/run_pb.py fq2bamfast --ref /workdir/01-ref/assemblies/hg38/Homo_sapiens_assembly38.fasta --in-fq /workdir/00-raw/parabricks_sample/sample_1.fq.gz /workdir/00-raw/parabricks_sample/sample_2.fq.gz @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1 --out-bam /outputdir/parabricks-germline.bam --memory-limit 31 --keep-tmp --no-postsort --gpuwrite --gpusort --filter-flag 0 --bwa-nstreams 1 --bwa-cpu-thread-pool 64 --max-read-length 480 --min-read-length 10 --tmp-dir /workdir/tmp/LHUW0YJG --num-gpus 1 --x3

[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Read group created for /workdir/00-raw/parabricks_sample/sample_1.fq.gz and
/workdir/00-raw/parabricks_sample/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
/usr/local/parabricks/binaries/bin/pbbwa /workdir/01-ref/assemblies/hg38/Homo_sapiens_assembly38.fasta --mode pair-ended-gpu /workdir/00-raw/parabricks_sample/sample_1.fq.gz /workdir/00-raw/parabricks_sample/sample_2.fq.gz -R @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1 --nGPUs 1 --nstreams 1 --cpu-thread-pool 64 -F 0 --min-read-size 10 --max-read-size 480 --markdups --write-bin
[PB Info 2024-Jun-11 19:41:32] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 19:41:32] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2024-Jun-11 19:41:32] ||                              Version 4.3.1-1                             ||
[PB Info 2024-Jun-11 19:41:32] ||                      GPU-PBBWA mem, Sorting Phase-I                      ||
[PB Info 2024-Jun-11 19:41:32] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 19:41:32] Mode = pair-ended-gpu
[PB Info 2024-Jun-11 19:41:32] Running with 1 GPU(s), using 1 stream(s) per device with 64 worker threads per GPU
[PB Info 2024-Jun-11 19:41:42] # 100  0  1  0  0   0 pool:  0 618190770 bases/GPU/minute: 3709144620.0
[PB Info 2024-Jun-11 19:41:52] # 100  0  1  0  0   0 pool:  1 1409767833 bases/GPU/minute: 4749462378.0
[PB Info 2024-Jun-11 19:42:02] # 100  0  1  0  0   0 pool:  2 2186274024 bases/GPU/minute: 4659037146.0
[PB Info 2024-Jun-11 19:42:12] # 100  1  1  1  0   0 pool:  0 2970315321 bases/GPU/minute: 4704247782.0
[PB Info 2024-Jun-11 19:42:22] # 100  1  1  1  0   0 pool:  1 3754356400 bases/GPU/minute: 4704246474.0
[PB Info 2024-Jun-11 19:42:32] # 100  5  0  1  0   0 pool:  1 4530850485 bases/GPU/minute: 4658964510.0
[PB Info 2024-Jun-11 19:42:42] Time spent reading: 10.986321 seconds
[PB Info 2024-Jun-11 19:42:42] # 99  0  5  0  0   0 pool:  3 5329972596 bases/GPU/minute: 4794732666.0
[PB Info 2024-Jun-11 19:42:50] GPU 0 exited
[PB Info 2024-Jun-11 19:42:51] A CPU has exited
[PB Info 2024-Jun-11 19:42:52] #  0  0  0  0  0   0 pool: 100 6133353829 bases/GPU/minute: 4820287398.0
[PB Info 2024-Jun-11 19:43:02] Rate stats (based on sampling every 10 seconds):
	min rate: 3709144620.0 bases/GPU/minute
	max rate: 4820287398.0 bases/GPU/minute
	avg rate: 4600015371.8 bases/GPU/minute
[PB Info 2024-Jun-11 19:43:02] Time spent monitoring (multiple of 10): 90.005
[PB Info 2024-Jun-11 19:43:02] bwalib run finished in 81.059 seconds
[PB Info 2024-Jun-11 19:43:02] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 19:43:02] ||        Program:                    GPU-PBBWA mem, Sorting Phase-I        ||
[PB Info 2024-Jun-11 19:43:02] ||        Version:                                           4.3.1-1        ||
[PB Info 2024-Jun-11 19:43:02] ||        Start Time:                       Tue Jun 11 19:41:32 2024        ||
[PB Info 2024-Jun-11 19:43:02] ||        End Time:                         Tue Jun 11 19:43:02 2024        ||
[PB Info 2024-Jun-11 19:43:02] ||        Total Time:                            1 minute 30 seconds        ||
[PB Info 2024-Jun-11 19:43:02] ------------------------------------------------------------------------------
/usr/local/parabricks/binaries/bin/sort -sort_unmapped -ft 10 -gb 31 -gpu 1
[PB Info 2024-Jun-11 19:43:02] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 19:43:02] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2024-Jun-11 19:43:02] ||                              Version 4.3.1-1                             ||
[PB Info 2024-Jun-11 19:43:02] ||                             Sorting Phase-II                             ||
[PB Info 2024-Jun-11 19:43:02] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 19:43:02] progressMeter - Percentage
[PB Info 2024-Jun-11 19:43:02] 0.0
[PB Info 2024-Jun-11 19:43:07] Sorting and Marking: 5.001 seconds
[PB Info 2024-Jun-11 19:43:07] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 19:43:07] ||        Program:                                  Sorting Phase-II        ||
[PB Info 2024-Jun-11 19:43:07] ||        Version:                                           4.3.1-1        ||
[PB Info 2024-Jun-11 19:43:07] ||        Start Time:                       Tue Jun 11 19:43:02 2024        ||
[PB Info 2024-Jun-11 19:43:07] ||        End Time:                         Tue Jun 11 19:43:07 2024        ||
[PB Info 2024-Jun-11 19:43:07] ||        Total Time:                                      5 seconds        ||
[PB Info 2024-Jun-11 19:43:07] ------------------------------------------------------------------------------
/usr/local/parabricks/run_pb.py haplotypecaller --ref /workdir/01-ref/assemblies/hg38/Homo_sapiens_assembly38.fasta --in-bam /outputdir/parabricks-germline.bam --out-variants /outputdir/parabricks-germline-variants.vcf --ploidy 2 --num-htvc-threads 5 --read-from-tmp-dir --tmp-dir /workdir/tmp/LHUW0YJG --num-gpus 1 --x3
/usr/local/parabricks/binaries/bin/postsort /workdir/01-ref/assemblies/hg38/Homo_sapiens_assembly38.fasta -o /outputdir/parabricks-germline.bam -sort_unmapped -gpuwrite -ft 4 -zt 17 -bq 1 -ngpu 1 -gb 31

/usr/local/parabricks/scheduler.py /usr/local/parabricks/binaries/bin/htvc /workdir/01-ref/assemblies/hg38/Homo_sapiens_assembly38.fasta /outputdir/parabricks-germline.bam 1 -o /outputdir/parabricks-germline-variants.vcf -nt 5 -read-tmp
[PB Info 2024-Jun-11 19:43:08] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 19:43:08] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2024-Jun-11 19:43:08] ||                              Version 4.3.1-1                             ||
[PB Info 2024-Jun-11 19:43:08] ||                         Marking Duplicates, BQSR                         ||
[PB Info 2024-Jun-11 19:43:08] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 19:43:08] CuBamWriter using CUDA device 0
[PB Info 2024-Jun-11 19:43:08] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 19:43:08] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2024-Jun-11 19:43:08] ||                              Version 4.3.1-1                             ||
[PB Info 2024-Jun-11 19:43:08] ||                         GPU-GATK4 HaplotypeCaller                        ||
[PB Info 2024-Jun-11 19:43:08] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 19:43:08] Using PBCuBamWriter for BAM writing (algo 0)
[PB Info 2024-Jun-11 19:43:08] progressMeter -	Percentage
[PB Info 2024-Jun-11 19:43:15] 0 /outputdir/parabricks-germline.bam/outputdir/parabricks-germline-variants.vcf
[PB Info 2024-Jun-11 19:43:15] ProgressMeter -	Current-Locus	Elapsed-Minutes	Regions-Processed	Regions/Minute
[PB Error 2024-Jun-11 19:43:15][src/assembleRegions_GPU.cu:211] cudaSafeCall() failed: out of memory, exiting.

dpuleri · June 11, 2024, 7:54pm

Try without --read-from-tmp-dir. Excerpt from the docs below. It runs two stages in parallel and that requires more device memory.

Running variant caller reading from bin files generated by Aligner and sort. Run postsort in parallel. This option will increase device memory usage. (default: None)

khamenya · June 11, 2024, 8:04pm

--bwa-cpu-thread-pool 48 also works now, but it is a bit slower without --read-from-tmp-dir

khamenya · June 11, 2024, 8:22pm

interesting, the peak throughput on real WGS data from DanteLabs is 2.4 Gpb/s, which is visibly less than 4.5-6 Gbps on the sample data from tutorial.

khamenya · June 12, 2024, 6:12am

Dropping --bwa-cpu-thread-pool 48 to 32, 16 and even to 8 didn’t help.

Here is log tail of crash on Sorting Phase-II for 32:

...
[PB Info 2024-Jun-11 22:00:12] #  0 43  0  3  0   0 pool: 100 94639915903 bases/GPU/minute: 1960982994.0
[PB Info 2024-Jun-11 22:00:21] GPU 0 exited
[PB Info 2024-Jun-11 22:00:21] A CPU has exited
[PB Info 2024-Jun-11 22:00:22] #  0  0  0  0  0   0 pool: 100 94948513950 bases/GPU/minute: 1851588282.0
[PB Info 2024-Jun-11 22:00:32] Rate stats (based on sampling every 10 seconds):
	min rate: 1489209456.0 bases/GPU/minute
	max rate: 2235141594.0 bases/GPU/minute
	avg rate: 1802819885.1 bases/GPU/minute
[PB Info 2024-Jun-11 22:00:32] Time spent monitoring (multiple of 10): 3170.459
[PB Info 2024-Jun-11 22:00:32] bwalib run finished in 3161.453 seconds
[PB Info 2024-Jun-11 22:00:32] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 22:00:32] ||        Program:                    GPU-PBBWA mem, Sorting Phase-I        ||
[PB Info 2024-Jun-11 22:00:32] ||        Version:                                           4.3.1-1        ||
[PB Info 2024-Jun-11 22:00:32] ||        Start Time:                       Tue Jun 11 21:07:42 2024        ||
[PB Info 2024-Jun-11 22:00:32] ||        End Time:                         Tue Jun 11 22:00:32 2024        ||
[PB Info 2024-Jun-11 22:00:32] ||        Total Time:                          52 minutes 50 seconds        ||
[PB Info 2024-Jun-11 22:00:32] ------------------------------------------------------------------------------
/usr/local/parabricks/binaries/bin/sort -sort_unmapped -ft 10 -gb 31 -gpu 1
[PB Info 2024-Jun-11 22:00:34] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 22:00:34] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2024-Jun-11 22:00:34] ||                              Version 4.3.1-1                             ||
[PB Info 2024-Jun-11 22:00:34] ||                             Sorting Phase-II                             ||
[PB Info 2024-Jun-11 22:00:34] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 22:00:34] progressMeter - Percentage
[PB Info 2024-Jun-11 22:00:34] 0.0
[PB Info 2024-Jun-11 22:00:39] 21.3
[PB Info 2024-Jun-11 22:00:44] 41.7
[PB Info 2024-Jun-11 22:00:49] 61.8
[PB Info 2024-Jun-11 22:00:54] 82.6
[PB Info 2024-Jun-11 22:00:59] 98.6
[PB Error 2024-Jun-11 22:01:01][/home/jenkins/agent/workspace/parabricks-branch-build//common/buffer.cuh:180] CUDA_CHECK() failed with out of memory (2), exiting.

sadly, reducing to 4 is not interesting anymore, because even with --bwa-cpu-thread-pool 16 we already having the performance as without --fq2bamfast: it crashed after 1 hour 14 minutes, whereas successful run without --fq2bamfast is in 1 hour 40 minutes.

also the throughput with --bwa-cpu-thread-pool 16 is about 1-1.2 Gbp/s most of the time, which is 5x to 6x dropdown if compared to stand-alone direct invocation of fq2bamfast --bwa-cpu-thread-pool 64 ... .

UPSHOT:
--fq2bamfast is not yet usable from within pbrun germline on RTX 4090, but is well usable and indeed high-performing in a stand-alone “manual” invocations as of clara-parabricks:4.3.1-1

dpuleri · June 12, 2024, 2:33pm

The throughput is highly dependent on the input data (e.g., what sequencer, read length, how many mutations) so it is not comparable between datasets.

Topic		Replies	Views
Could not run fq2bam as part of germline pipeline Parabricks	9	2200	April 8, 2022
Does the germline pipeline call through the ApplyBQSR process? Parabricks	3	789	July 9, 2021
Problem with gpu Parabricks ai	12	2508	November 1, 2024
"Could not run fq2bam" Is the only verbose output from Parabricks 4.4.0-1 and 4.3.2-1 on tutorial data Parabricks ai , demos-and-tutorials , fq2bam	15	205	March 3, 2025
PARABRICKS mem from pbrun germline command hanging and not finishing Parabricks	7	1487	July 5, 2022
Could not run fq2bam as part of germline pipeline (Version 4.0.1-1 ) Parabricks ai , nvidia-smi , fq2bam	11	161	December 9, 2024
Run parabricks and found cudaMemGetInfo returned 802 Parabricks	8	1589	January 13, 2022
[Nvidia/Parabricks] got an error on running Marking Duplicates (with the official Parabricks samples) Parabricks	5	1221	October 12, 2021
Fq2bam on GCP Parabricks ai , fq2bam	10	69	December 11, 2024
Clara-parabricks_4.1.0-1.sif can not recognize A100 cards? Parabricks ai	12	1148	July 2, 2024

Let's run germline pipeline + fq2bamfast on RTX 4090 24GB VRAM with 24 cores CPU and 64GB RAM

Related topics