Can I use Clara Parabricks on DGX Spark?

I was wondering if it’s possible to run Clara Parabricks on DGX Spark. I’m considering purchasing one for personal use and would like to know if this is a supported configuration.

We are currently testing workloads on DGX Spark. We’ll have more clarity soon. Stay tuned.

2 Likes

Do you have any updates for the test?

waiting for the benchmark

We have a listed of currently supported workloads here: DGX Spark | Try NVIDIA NIM APIs

However, the parabricks documentation states: “Running Parabricks on a single GPU is supported but not recommended.”

2 Likes

Hello, I’m trying to run Parabricks on a DGX Spark, but the job fails with a runtime error (error log below). nvidia-smi shows Memory-Usage: Not Supported, and I’m wondering whether this is related to fused/unified memory, MIG, or a driver/CUDA compatibility issue.

Environment summary:
NVIDIA Driver: 580.82.09
CUDA Version: 13.0
GPU: NVIDIA GB10 (single GPU)

Command and Error:
test-1@spark-b9be:~/01.test$ docker run --rm --gpus all -v “$PWD”:/work nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 pbrun fq2bam --ref /work/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa --in-fq /work/reads/HG002_HiSeq30x_subsampled_R1.fastq.gz /work/reads/HG002_HiSeq30x_subsampled_R2.fastq.gz --out-bam /work/HG002.fq2bam.bam

[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Auto mode: Setting --bwa-nstreams and device memory parameters automatically based on
available GPU memory. These settings are optimized for the GRCh38 reference genome and may need manual adjustment for
other references or optimal performance. For manual configuration guidance, see the fq2bam documentation.
Traceback (most recent call last):
File “/usr/local/parabricks/run_pb.py”, line 1758, in
run_pb_main()
File “/usr/local/parabricks/run_pb.py”, line 1697, in run_pb_main
pbargs_check.pbargs_check(argsObj)
File “/usr/local/parabricks/pbargs_check.py”, line 1091, in pbargs_check
check_fq2bam(runArgs.runArgs, True)
File “/usr/local/parabricks/pbargs_check.py”, line 122, in check_fq2bam
memories = GetDevicesAvailableMemory()
File “/usr/local/parabricks/pbutils.py”, line 108, in GetDevicesAvailableMemory
memMiB.append(int(line.split(“,”)[2]))
ValueError: invalid literal for int() with base 10: ’ [N/A]’
Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation

Hi, please reference our FAQ on how nvidia-smi works with unified memory
https://forums.developer.nvidia.com/t/dgx-spark-gb10-faq/347344/7

EDIT: Please use nvcr.io/nvidia/clara/clara-parabricks:4.6.0-2. The suggestions below are no longer needed.

Thanks for the details and the log. This is not an issue due to unified memory, MIG, or CUDA/driver mismatch.

Unfortunately it is a known issue that DGX Spark does not report the amount of GPU memory through nvidia-smi (see Known Issues — DGX Spark User Guide ). This is interfering with a new automatic configuration feature which we added by default in NVIDIA Parabricks v4.6.0. We will more gracefully handle the result in a future release.

In the meantime, one can bypass the automatic configuration by specifying parameters which have a default value of auto. I have verified that the following configurations for fq2bam, deepvariant, and haplotypecaller are valid with satisfactory performance. You may get better performance with different parameters, depending on your specific use case, as we did not perform an exhaustive search for the best configuration.

# fq2bam
docker run --rm --runtime=nvidia --gpus all nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 pbrun fq2bam --ref ${REFERENCE} --in-fq ${FQ1} ${FQ2} --out-bam ${outputfile} --bwa-nstreams 3 --bwa-primary-cpus 16 --bwa-cpu-thread-pool 1 --gpusort

# deepvariant (faster performance with `--use-tf32` at a slight loss of accuracy, remove for best accuracy with slower runtime)
docker run --rm --runtime=nvidia --gpus all nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 pbrun deepvariant --ref ${REFERENCE} --in-bam ${IN_BAM} --out-variants ${OUT}.vcf --num-streams-per-gpu 4 --use-tf32

# haplotypecaller
docker run --rm --runtime=nvidia --gpus all nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 pbrun haplotypecaller --ref ${REFERENCE} --in-bam ${IN_BAM} --out-variants ${OUT}.vcf --num-htvc-threads 8

How about combining 2 DGX Sparks?

Thank you for your guidance and suggestions. I applied the parameters you provided and successfully ran fq2bam and HaplotypeCaller on one sample (HG002). Below are my working environment, the commands I used, and the execution time:

Environment summary:
NVIDIA Driver: 580.82.09
CUDA Version: 13.0
GPU: NVIDIA GB10 (single GPU)

Command and Run time

# fq2bam: 56min

docker run --rm --gpus all -v "$PWD":/work nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 pbrun fq2bam --ref /work/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa --in-fq /work/reads/HG002_HiSeq30x_subsampled_R1.fastq.gz /work/reads/HG002_HiSeq30x_subsampled_R2.fastq.gz --out-bam /work/HG002.fq2bam.bam --bwa-nstreams 3 --bwa-primary-cpus 16 --bwa-cpu-thread-pool 1 --gpusort



# haplotypecaller: 66min

docker run --rm --gpus all -v "$PWD":/work nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 pbrun haplotypecaller --ref /work/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa --in-bam /work/HG002.fq2bam.bam --gvcf --out-variants /work/HG002.haplotypecaller.g.vcf.gz --num-htvc-threads 8

However, when I tested another sample (HG00514: ENA Browser ), the situation became more complicated.

# Reference and index: https://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
bwa index GRCh38_full_analysis_set_plus_decoy_hla.fa

# command
docker run --rm --gpus all -v "$PWD":/work nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 pbrun fq2bam --ref /work/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa --in-fq /work/reads/HG00514_R1.fq.gz /work/reads/HG00514_R2.fq.gz --out-bam /work/HG00514.fq2bam.bam --bwa-nstreams 3 --bwa-primary-cpus 16 --bwa-cpu-thread-pool 1 –gpusort

During the run, as shown in the following figure, I encountered several unusual messages such as “Auto-retry mode with partial batches was unsuccessful; going to CPU recovery”, “Single-ended recovery mode for batch with 70320128 (both ends) reads before itself”, and “Pair-ended recovery mode for batch with 85065728 (both ends) reads before itself.”

During the execution of fq2bam, the entire Spark computing resources were almost completely occupied by Parabricks, including the GB10 GPU, 20 CPU threads, and 128 GB of memory.

However, in the end, the fq2bam task for HG00514 failed due to complete memory exhaustion:

Based on the observations above, I have the following specific questions:

  1. For the HG002 sample, where Spark completed successfully, Parabricks fq2bam used the GPU while also consuming 20 CPU threads. The total runtime was nearly one hour.
    Is this the expected performance? Previously, when I tested with a single RTX 3090 GPU using the --low-memory parameter, the runtime was also close to one hour. This makes it difficult for me to observe any performance improvement from Spark’s large-memory environment.
  2. For the HG00514 sample, what caused the unusual messages, such as the recovery warnings? And since the job eventually failed due to complete memory exhaustion, what is the recommended way to resolve this?
  3. These tests seem to indicate some potential compatibility or stability issues in certain cases. Can Parabricks be reliably applied to non-human species data?

Hello,

We have published a new container with fixes for the issues you have brought up: nvcr.io/nvidia/clara/clara-parabricks:4.6.0-2.

Thank you for your question. Regarding performance, your results with HG002 are aligned with what we have seen in testing. Parabricks has a minimum requirement of 16GB of GPU memory. You may find that the performance gain on Spark vs. other systems varies between different Parabricks tools. Fq2bam is compute heavy with no deep learning. Other tools in Parabricks, such as DeepVariant, incorporate deep learning so they can take advantage of the Blackwell tensor cores.

The log messages regarding recovery mode are fine. They are log messages meant to communicate that intermediate data may not be able to be processed on the GPU either because it invalidates some assumptions for our device code or some other issue happened. When this happens, the batch of reads is then sent to the CPU for processing and the run proceeds. Typically, it either does not happen (as you saw previously) or happens very infrequently but it is data dependent. Every release we eliminate more scenarios where we may need to fallback to CPU processing.

Regarding exhausting host memory, this was a bug in the recovery fallback and has been fixed in nvcr.io/nvidia/clara/clara-parabricks:4.6.0-2. This release also fixes the default behavior you noted and which I had confirmed in Can I use Clara Parabricks on DGX Spark? - #9 by dpuleri . You no longer need to specify those performance parameters for Spark. However, feel free to modify them to find the most optimal mix.

Yes, Parabricks can reliably be applied to non-human species. Apologies for the bug you encountered.

1 Like

Parabricks supports single node applications so 2 nodes networked together will not work as one run together. It will only use one node. You can do an instance of a run on each node for high throughput work.