GPUS+SLURM+NVIDIA/Parabricks : sometimes I get "cudaGetDevice() failed in geting device ID. Status: unknown error, exiting"

pierre.lindenbaum · September 10, 2025, 4:30pm

Hi all,

I’m running fq2bam 4.5.0-1 within an apptainer container + nextflow + slurm.

Sometimes, fq2bam runs without any problem (exit status=0):

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:21:00.0 Off |                    0 |
| N/A   30C    P0             33W /  250W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  |   00000000:81:00.0 Off |                    0 |
| N/A   30C    P0             33W /  250W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
(...)
[PB Info 2025-Sep-10 15:26:49] ------------------------------------------------------------------------------     
[PB Info 2025-Sep-10 15:26:49] ||                 Parabricks accelerated Genomics Pipeline                 ||                                                                 
[PB Info 2025-Sep-10 15:26:49] ||                              Version 4.5.0-1                             ||                                                                 
[PB Info 2025-Sep-10 15:26:49] ||                      GPU-PBBWA mem, Sorting Phase-I                      ||                                                                 
[PB Info 2025-Sep-10 15:26:49] ------------------------------------------------------------------------------                                                                 
[PB Info 2025-Sep-10 15:26:49] Mode = pair-ended-gpu                                                                                                                          
[PB Info 2025-Sep-10 15:26:49] Running with 2 GPU(s), using 4 stream(s) per device with 24 worker threads per GPU                                                             
(...)                                                  
Total Time:                             5 minutes 1 second        ||            
(exit status = 0)

but sometimes I get an error:

GPUS+SLURM+NVIDIA/Parabricks : cudaGetDevice() failed in geting device ID. Status: unknown error, exiting

  | NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |                                                                                 
  |-----------------------------------------+------------------------+----------------------+                                                                                 
  | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |                                                                                 
  | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |                                                                                 
  |                                         |                        |               MIG M. |                                                                                 
  |=========================================+========================+======================|                                                                                 
  |   0  NVIDIA A100-PCIE-40GB          On  |   00000000:21:00.0 Off |                    0 |                                                                                 
  | N/A   32C    P0             39W /  250W |       0MiB /  40960MiB |      0%      Default |                                                                                 
  |                                         |                        |             Disabled |                                                                                 
  +-----------------------------------------+------------------------+----------------------+                                                                                 
  |   1  NVIDIA A100-PCIE-40GB          On  |   00000000:81:00.0 Off |                    0 |                                                                                 
  | N/A   29C    P0             33W /  250W |       0MiB /  40960MiB |      0%      Default |                                                                                 
  |                                         |                        |             Disabled |                                                                                 
  +-----------------------------------------+------------------------+----------------------+                                                                                 
                                                                                                                                                                              
  +-----------------------------------------------------------------------------------------+                                                                                 
  | Processes:                                                                              |                                                                                 
  |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |                                                                                 
  |        ID   ID                                                               Usage      |                                                                                 
  |=========================================================================================|                                                                                 
  |  No running processes found                                                             |    `
(...)
Mode = pair-ended-gpu                                                                                                                        
Running with 2 GPU(s), using 1 stream(s) per device with 24 worker threads per GPU                                                           
cudaGetDevice() failed in geting device ID. Status: unknown error, exiting.

How can I fix this please ?

Topic		Replies	Views
NVIDIA Parabricks v4.3.1-1run germline pipeline error: cudaGetDevice() failed in geting device ID Parabricks cuda , ai	1	393	July 2, 2024
Could not run fq2bam as part of germline pipeline (Version 4.0.1-1 ) Parabricks ai , nvidia-smi , fq2bam	11	275	December 9, 2024
Run parabricks and found cudaMemGetInfo returned 802 Parabricks	8	1680	January 13, 2022
Could not run fq2bam when try align a sequence Parabricks ai	1	1806	December 10, 2023
"Could not run fq2bam" Is the only verbose output from Parabricks 4.4.0-1 and 4.3.2-1 on tutorial data Parabricks ai , demos-and-tutorials , fq2bam	15	383	March 3, 2025
Fq2bam sometimes gives an error Parabricks ai , fq2bam , deepvariant	2	126	September 18, 2025
Clara-parabricks_4.1.0-1.sif can not recognize A100 cards? Parabricks ai	12	1244	July 2, 2024
Could not run fq2bam Parabricks ai , nvidia-smi	2	1050	November 8, 2023
Can't run parabricks "cudaMemGetInfo returned 802 -> system not yet initialized Parabricks	0	836	January 6, 2022
Fq2bam Error Received signal: 11 Parabricks cuda , ai	3	1641	May 4, 2023

GPUS+SLURM+NVIDIA/Parabricks : sometimes I get "cudaGetDevice() failed in geting device ID. Status: unknown error, exiting"

Related topics