Clara-parabricks_4.1.0-1.sif can not recognize A100 cards?

hi, trying to run fq2bam with a A100 node, get this error message:

[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /home/crick/working/haplotypecalling/Data/sample_1.fq.gz and
/home/crick/working/haplotypecalling/Data/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[PB Info 2023-Jun-01 00:50:13] ------------------------------------------------------------------------------
[PB Info 2023-Jun-01 00:50:13] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2023-Jun-01 00:50:13] || Version 4.1.0-1 ||
[PB Info 2023-Jun-01 00:50:13] || GPU-BWA mem, Sorting Phase-I ||
[PB Info 2023-Jun-01 00:50:13] ------------------------------------------------------------------------------
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[PB Error 2023-Jun-01 00:50:14][ParaBricks/src/pbOpts.cu:107] Bad argument value: Number of GPUs requested (4) is more than number of GPUs (0in the system., exiting.
For technical support visit Help - NVIDIA Docs
Exiting…

Could not run fq2bam
Exiting pbrun …

===
on the host:

[crick@csctmp-xe8545-2 haplotypecalling]$ nvidia-smi
Thu Jun 1 01:29:36 2023
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:01:00.0 Off | 0 |
| N/A 20C P0 57W / 500W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:41:00.0 Off | 0 |
| N/A 19C P0 56W / 500W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:81:00.0 Off | 0 |
| N/A 21C P0 58W / 500W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:C1:00.0 Off | 0 |
| N/A 18C P0 59W / 500W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
±--------------------------------------------------------------------------------------+

In the container:

[crick@csctmp-xe8545-2 haplotypecalling]$ singularity shell clara-parabricks_4.1.0-1.sif
Singularity> nvidia-smi
Thu Jun 1 01:30:58 2023
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:01:00.0 Off | 0 |
| N/A 20C P0 57W / 500W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:41:00.0 Off | 0 |
| N/A 19C P0 56W / 500W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:81:00.0 Off | 0 |
| N/A 21C P0 58W / 500W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:C1:00.0 Off | 0 |
| N/A 18C P0 59W / 500W| 0MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
±--------------------------------------------------------------------------------------+
Singularity>

Any clue for it?

Thanks,

Wei

1 Like

btw, I have --nv and --num-gpus in my code. My code works with V100 cards. Also in sigularity.config I modified “always use nv = yes” “always use rocm = yes”

Hi,
I observed the exact same error on 4xA100 cards, running on Docker.

It works now for me.
I re-installed cuda 12 correctly and disabled MIG service.

Thanks,

Parabricks does not support MIG mode.

Hi, Have you already resolved this issue? I encountered the same problem after upgrading CUDA and the driver. Additionally, I am just a regular user on our HPC, so I cannot reinstall CUDA.

MIG service on my device has been disabled, but i meet the same issue, I have a very urgent task that needs to be completed. Can someone please help me?

Can you share more details? Please share full command you ran and the full output.

What driver/CUDA version was working previously? What version of Parabricks?

nvidia-smi
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:0E:00.0 Off | 0 |
| N/A 26C P0 40W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:13:00.0 Off | 0 |
| N/A 26C P0 41W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 2 NVIDIA A100-SXM4-40GB On | 00000000:49:00.0 Off | 0 |
| N/A 24C P0 42W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 3 NVIDIA A100-SXM4-40GB On | 00000000:4F:00.0 Off | 0 |
| N/A 27C P0 40W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 4 NVIDIA A100-SXM4-40GB On | 00000000:94:00.0 Off | 0 |
| N/A 27C P0 43W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 5 NVIDIA A100-SXM4-40GB On | 00000000:9A:00.0 Off | 0 |
| N/A 24C P0 42W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 6 NVIDIA A100-SXM4-40GB On | 00000000:CC:00.0 Off | 0 |
| N/A 25C P0 41W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 7 NVIDIA A100-SXM4-40GB On | 00000000:D1:00.0 Off | 0 |
| N/A 25C P0 43W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
±--------------------------------------------------------------------------------------+
MY command:
pbrun fq2bam_meth --ref hg38.fa --in-fq …/fq2bam_test/SRR13749850_1.fastq.gz …/fq2bam_test/SRR13749850_1.fastq.gz --out-bam SRR13749850.bam --num-gpus 1
my decker command : docker run -it --gpus=‘“device=0,1”’ -e NVIDIA_VISIBLE_DEVICES=0,1 --volume /home/wang_yanni/sc_meth_ATAC/integration/ATAC_Meth/natu re_data_atac_meth/fq2bam:/mydata 74f2b983a773 /bin/bash

The error
root@5b45d44a0a2e:/mydata/fq2bam_meth# pbrun fq2bam_meth --ref hg38.fa --in-fq …/fq2bam_test/SRR13749850_1.fastq.gz …/fq2bam_test/SRR13749850_1.fastq. gz --out-bam SRR13749850.bam
Please visit NVIDIA Clara - NVIDIA Docs for detailed documentation

[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /mydata/fq2bam_test/SRR13749850_1.fastq.gz and
/mydata/fq2bam_test/SRR13749850_1.fastq.gz
[Parabricks Options Mesg]: @RG\tID:SRR13749850.1.1\tLB:lib1\tPL:bar\tSM:sample\tPU:SRR13749850.1.1
[PB Info 2024-Jul-02 14:26:45] ------------------------------------------------------------------------------
[PB Info 2024-Jul-02 14:26:45] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2024-Jul-02 14:26:45] || Version 4.3.1-1 ||
[PB Info 2024-Jul-02 14:26:45] || GPU-PBBWA mem, Sorting Phase-I ||
[PB Info 2024-Jul-02 14:26:45] ------------------------------------------------------------------------------
[PB Info 2024-Jul-02 14:26:45] Mode = pair-ended-gpu
[PB Info 2024-Jul-02 14:26:45] Running with 8 GPU(s), using 4 stream(s) per device with 16 worker threads per GPU
[PB Info 2024-Jul-02 14:26:55] # 0 0 0 0 0 0 pool: 0 0 bases/GPU/minute: 0.0
[PB Info 2024-Jul-02 14:27:04] Time spent reading: 0.008965 seconds
[PB Error 2024-Jul-02 14:27:05][src/internal/bwa_lib_context.cu:86] cudaGetDevice() failed in geting device ID. Status: system not yet initialized, exit ing.
For technical support visit NVIDIA Clara - NVIDIA Docs
Exiting…

-e NVIDIA_VISIBLE_DEVICES=0,1 - this docker flag is wrong. It should be:

CUDA_VISIBLE_DEVICES

You can remove this part from the command though. We suggest just using this command:

docker run -it --gpus '"device=0,1"' --volume /home/wang_yanni/sc_meth_ATAC/integration/ATAC_Meth/natu re_data_atac_meth/fq2bam:/mydata 74f2b983a773 /bin/bash

You should also confirm things look correct in the container by running nvidia-smi once inside

2 Likes

I ran nvidia-smi in docker image, it showed
root@c3a39e4fa403:/code# nvidia-smi
Tue Jul 2 17:12:59 2024
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB Off | 00000000:0E:00.0 Off | 0 |
| N/A 27C P0 40W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA A100-SXM4-40GB Off | 00000000:13:00.0 Off | 0 |
| N/A 27C P0 42W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
±--------------------------------------------------------------------------------------+
then I ran: pbrun fq2bam --ref /mydata/hg38.fa --in-fq /mydata/fq2bam_test/SRR13749850_1.fastq.gz /mydata/fq2bam_test/SRR13749850_2.fastq.gz --out-bam /mydata/SRR13749850_new.bam --no-markdups --num-gpus 1 , generating the same error:
root@c3a39e4fa403:/mydata# pbrun fq2bam --ref /mydata/hg38.fa --in-fq /mydata/fq2bam_test/SRR13749850_1.fastq.gz /mydata/fq2bam_test/SRR13749850_2.fastq.gz --out-bam /mydata/SRR13749850_new.bam --no-markdups --num-gpus 1
Please visit NVIDIA Clara - NVIDIA Docs for detailed documentation

[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /mydata/fq2bam_test/SRR13749850_1.fastq.gz and
/mydata/fq2bam_test/SRR13749850_2.fastq.gz
[Parabricks Options Mesg]: @RG\tID:SRR13749850.1.1\tLB:lib1\tPL:bar\tSM:sample\tPU:SRR13749850.1.1
[PB Info 2024-Jul-02 17:14:46] ------------------------------------------------------------------------------
[PB Info 2024-Jul-02 17:14:46] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2024-Jul-02 17:14:46] || Version 4.0.0-1 ||
[PB Info 2024-Jul-02 17:14:46] || GPU-BWA mem, Sorting Phase-I ||
[PB Info 2024-Jul-02 17:14:46] ------------------------------------------------------------------------------
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[PB Error 2024-Jul-02 17:15:06][ParaBricks/src/pbOpts.cu:132] Bad argument value: Number of GPUs requested (1) is more than number of GPUs (0in the system., exiting.
For technical support visit Help - NVIDIA Docs
Exiting…

Could not run fq2bam
Exiting pbrun …

Once inside the container, what does this print?

echo $CUDA_VISIBLE_DEVICES