germline pipeline runs on RTX 4090 using plain invocation from docs!
It takes 4 minutes on Parabricks sample data from tutorial and
it takes 1 hour 40 minutes on DanteLabs 30x WGS (hooman being DNA).
But I fail to enable --fq2bamfast as a part of pbrun germline invocation. I tried many reasonable changes to the example invocation from docs to no avail.
Let’s start from minimal change (--run-partition is not possible with 1 GPU) to the example invocation from docs:
fq2bamfast also succeeds if started stand alone with --low-memory provided. Remarkably, this option overrides settings provided by --bwa-nstreams and forces it to be 1.
interestingly, --bwa-nstreams 2 runs the same fast as --low-memory (which forces --bwa-nstreams 1) 😳
neither --bwa-nstreams 1 nor --low-memory help to revive fq2bamfast within pbrun germline call.
btw, the peak performance (4.6G bp/minute) of fq2bamfast on my hardware (see topic title) is reached around/between --bwa-cpu-thread-pool 48 and 64. Surprisingly there is no difference between --bwa-nstreams 2 and 1
That makes sense. A 4090 has too little GPU memory to support many streams. We aim to have 1 stream supported with 16GB of GPU memory to support GPUs like the T4.
It should be the same stand alone as it would be in the pipeline format. Can you give the full commands for your stand alone run and the one where you pass --bwa-nstreams # to germline pipeline? Can you also lower the --bwa-cpu-thread-pool to 16 or less? It looks like a segmentation fault and that would be in host memory so perhaps it is running out of host memory. Fewer threads can influence it to use less memory.
That is intended behavior as the number of streams is the largest contributor to GPU memory usage.
You would see more of a difference on datacenter GPUs like A100 or H100.
+ docker run --rm --gpus all --volume ./data:/workdir --volume ./parabrick_sample/:/outputdir --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 nvcr.io/nvidia/clara/clara-parabricks:4.3.1-1 pbrun germline --verbose --ref /workdir/01-ref/assemblies/hg38/Homo_sapiens_assembly38.fasta --in-fq /workdir/00-raw/parabricks_sample/sample_1.fq.gz /workdir/00-raw/parabricks_sample/sample_2.fq.gz --out-bam /outputdir/parabricks-germline.bam --tmp-dir /workdir/tmp --num-cpu-threads-per-stage 1 --bwa-cpu-thread-pool 1 --bwa-nstreams 1 --out-variants /outputdir/parabricks-germline-variants.vcf --read-from-tmp-dir --gpusort --gpuwrite --fq2bamfast --low-memory --keep-tmp
Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /workdir/00-raw/parabricks_sample/sample_1.fq.gz and
/workdir/00-raw/parabricks_sample/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Read group created for /workdir/00-raw/parabricks_sample/sample_1.fq.gz and
/workdir/00-raw/parabricks_sample/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[Parabricks Options Mesg]: Using --low-memory sets the number of streams in bwa mem to 1.
[PB Info 2024-Jun-11 18:58:58] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 18:58:58] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2024-Jun-11 18:58:58] || Version 4.3.1-1 ||
[PB Info 2024-Jun-11 18:58:58] || GPU-PBBWA mem, Sorting Phase-I ||
[PB Info 2024-Jun-11 18:58:58] ------------------------------------------------------------------------------
[PB Debug 2024-Jun-11 18:58:58][src/main.cpp:235] Read in 1 fq pairs
[PB Info 2024-Jun-11 18:58:58] Mode = pair-ended-gpu
[PB Info 2024-Jun-11 18:58:58] Running with 1 GPU(s), using 1 stream(s) per device with 1 worker threads per GPU
[PB Debug 2024-Jun-11 18:59:00][src/internal/bwa_lib_context.cu:214] handle is created and associated with GPU ID: 0
[PB Debug 2024-Jun-11 18:59:00][src/internal/bwa_lib_context.cu:214] handle is created and associated with GPU ID: 0
[PB Debug 2024-Jun-11 18:59:01][src/internal/seed_gpu.cu:522] Using new-er seeding implementation
[PB Debug 2024-Jun-11 18:59:01][src/internal/seed_gpu.cu:535] Phase 1: allocatedSize %d3290643072
[PB Debug 2024-Jun-11 18:59:01][src/internal/chain_gpu.cu:41] Phase 2 allocated size 2455466512
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:119] Partial %d800262144
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:134] Left/Right %d2720525312
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:138] RMAX %d3040525312
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:145] Buf %d3706371072
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:173] Phase 3 allocatedSize %d3989748800
[PB Debug 2024-Jun-11 18:59:01][src/internal/sam_gpu.cu:104] Phase 4 allocatedSize %d3989615664
[PB Debug 2024-Jun-11 18:59:01][src/internal/seed_gpu.cu:522] Using new-er seeding implementation
[PB Debug 2024-Jun-11 18:59:01][src/internal/seed_gpu.cu:535] Phase 1: allocatedSize %d3290643072
[PB Debug 2024-Jun-11 18:59:01][src/internal/chain_gpu.cu:41] Phase 2 allocated size 2455466512
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:119] Partial %d800262144
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:134] Left/Right %d2720525312
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:138] RMAX %d3040525312
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:145] Buf %d3706371072
[PB Debug 2024-Jun-11 18:59:01][src/internal/reg_gpu.cu:173] Phase 3 allocatedSize %d3989748800
[PB Debug 2024-Jun-11 18:59:01][src/internal/sam_gpu.cu:104] Phase 4 allocatedSize %d3989615664
[PB Error 2024-Jun-11 18:59:01][-unknown-:0] Received signal: 11
well, I’m a non-enterprise customer and my interests are private.
For a few runs of “DIY” WGS-analysis A100 or H100 is quite an overkill.
when --fq2bamfast will become running within germline pipeline, the whole pbrun germline invocation will take ~50 min on RTX 4090, which is quite nice.
Could you try germline with --bwa-cpu-thread-pool 32? Two more things: take off --verbose and add --x3. The latter will show more details about the configuration of the run and verbose is adding too much noise right now.
It is more a configuration issue. We try to pick the best presets to run on most systems fast, but it is tricky. I agree that we should give a nicer error other than a seg fault for that though since it was truly running out of memory.
Basically there needs to be a balance between the number of CPU threads and GPU resources as it is a big pipeline and if the CPU or GPU is too slow then queues can keep growing. We have limits but they may be too large by default for desktop systems.
Try without --read-from-tmp-dir. Excerpt from the docs below. It runs two stages in parallel and that requires more device memory.
Running variant caller reading from bin files generated by Aligner and sort. Run postsort in parallel. This option will increase device memory usage. (default: None)
Dropping --bwa-cpu-thread-pool 48 to 32, 16 and even to 8 didn’t help.
Here is log tail of crash on Sorting Phase-II for 32:
...
[PB Info 2024-Jun-11 22:00:12] # 0 43 0 3 0 0 pool: 100 94639915903 bases/GPU/minute: 1960982994.0
[PB Info 2024-Jun-11 22:00:21] GPU 0 exited
[PB Info 2024-Jun-11 22:00:21] A CPU has exited
[PB Info 2024-Jun-11 22:00:22] # 0 0 0 0 0 0 pool: 100 94948513950 bases/GPU/minute: 1851588282.0
[PB Info 2024-Jun-11 22:00:32] Rate stats (based on sampling every 10 seconds):
min rate: 1489209456.0 bases/GPU/minute
max rate: 2235141594.0 bases/GPU/minute
avg rate: 1802819885.1 bases/GPU/minute
[PB Info 2024-Jun-11 22:00:32] Time spent monitoring (multiple of 10): 3170.459
[PB Info 2024-Jun-11 22:00:32] bwalib run finished in 3161.453 seconds
[PB Info 2024-Jun-11 22:00:32] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 22:00:32] || Program: GPU-PBBWA mem, Sorting Phase-I ||
[PB Info 2024-Jun-11 22:00:32] || Version: 4.3.1-1 ||
[PB Info 2024-Jun-11 22:00:32] || Start Time: Tue Jun 11 21:07:42 2024 ||
[PB Info 2024-Jun-11 22:00:32] || End Time: Tue Jun 11 22:00:32 2024 ||
[PB Info 2024-Jun-11 22:00:32] || Total Time: 52 minutes 50 seconds ||
[PB Info 2024-Jun-11 22:00:32] ------------------------------------------------------------------------------
/usr/local/parabricks/binaries/bin/sort -sort_unmapped -ft 10 -gb 31 -gpu 1
[PB Info 2024-Jun-11 22:00:34] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 22:00:34] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2024-Jun-11 22:00:34] || Version 4.3.1-1 ||
[PB Info 2024-Jun-11 22:00:34] || Sorting Phase-II ||
[PB Info 2024-Jun-11 22:00:34] ------------------------------------------------------------------------------
[PB Info 2024-Jun-11 22:00:34] progressMeter - Percentage
[PB Info 2024-Jun-11 22:00:34] 0.0
[PB Info 2024-Jun-11 22:00:39] 21.3
[PB Info 2024-Jun-11 22:00:44] 41.7
[PB Info 2024-Jun-11 22:00:49] 61.8
[PB Info 2024-Jun-11 22:00:54] 82.6
[PB Info 2024-Jun-11 22:00:59] 98.6
[PB Error 2024-Jun-11 22:01:01][/home/jenkins/agent/workspace/parabricks-branch-build//common/buffer.cuh:180] CUDA_CHECK() failed with out of memory (2), exiting.
sadly, reducing to 4 is not interesting anymore, because even with --bwa-cpu-thread-pool 16 we already having the performance as without --fq2bamfast: it crashed after 1 hour 14 minutes, whereas successful run without --fq2bamfast is in 1 hour 40 minutes.
also the throughput with --bwa-cpu-thread-pool 16 is about 1-1.2 Gbp/s most of the time, which is 5x to 6x dropdown if compared to stand-alone direct invocation of fq2bamfast --bwa-cpu-thread-pool 64 ... .
UPSHOT: --fq2bamfast is not yet usable from withinpbrun germline on RTX 4090, but is well usable and indeed high-performing in a stand-alone “manual” invocations as of clara-parabricks:4.3.1-1