GPUS+SLURM+NVIDIA/Parabricks : sometimes I get "cudaGetDevice() failed in geting device ID. Status: unknown error, exiting"

Hi all,

I’m running fq2bam 4.5.0-1 within an apptainer container + nextflow + slurm.

Sometimes, fq2bam runs without any problem (exit status=0):

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:21:00.0 Off |                    0 |
| N/A   30C    P0             33W /  250W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  |   00000000:81:00.0 Off |                    0 |
| N/A   30C    P0             33W /  250W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
(...)
[PB Info 2025-Sep-10 15:26:49] ------------------------------------------------------------------------------     
[PB Info 2025-Sep-10 15:26:49] ||                 Parabricks accelerated Genomics Pipeline                 ||                                                                 
[PB Info 2025-Sep-10 15:26:49] ||                              Version 4.5.0-1                             ||                                                                 
[PB Info 2025-Sep-10 15:26:49] ||                      GPU-PBBWA mem, Sorting Phase-I                      ||                                                                 
[PB Info 2025-Sep-10 15:26:49] ------------------------------------------------------------------------------                                                                 
[PB Info 2025-Sep-10 15:26:49] Mode = pair-ended-gpu                                                                                                                          
[PB Info 2025-Sep-10 15:26:49] Running with 2 GPU(s), using 4 stream(s) per device with 24 worker threads per GPU                                                             
(...)                                                  
Total Time:                             5 minutes 1 second        ||            
(exit status = 0)                                       

but sometimes I get an error:

GPUS+SLURM+NVIDIA/Parabricks : cudaGetDevice() failed in geting device ID. Status: unknown error, exiting

  | NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |                                                                                 
  |-----------------------------------------+------------------------+----------------------+                                                                                 
  | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |                                                                                 
  | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |                                                                                 
  |                                         |                        |               MIG M. |                                                                                 
  |=========================================+========================+======================|                                                                                 
  |   0  NVIDIA A100-PCIE-40GB          On  |   00000000:21:00.0 Off |                    0 |                                                                                 
  | N/A   32C    P0             39W /  250W |       0MiB /  40960MiB |      0%      Default |                                                                                 
  |                                         |                        |             Disabled |                                                                                 
  +-----------------------------------------+------------------------+----------------------+                                                                                 
  |   1  NVIDIA A100-PCIE-40GB          On  |   00000000:81:00.0 Off |                    0 |                                                                                 
  | N/A   29C    P0             33W /  250W |       0MiB /  40960MiB |      0%      Default |                                                                                 
  |                                         |                        |             Disabled |                                                                                 
  +-----------------------------------------+------------------------+----------------------+                                                                                 
                                                                                                                                                                              
  +-----------------------------------------------------------------------------------------+                                                                                 
  | Processes:                                                                              |                                                                                 
  |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |                                                                                 
  |        ID   ID                                                               Usage      |                                                                                 
  |=========================================================================================|                                                                                 
  |  No running processes found                                                             |    `
(...)
Mode = pair-ended-gpu                                                                                                                        
Running with 2 GPU(s), using 1 stream(s) per device with 24 worker threads per GPU                                                           
cudaGetDevice() failed in geting device ID. Status: unknown error, exiting. 

How can I fix this please ?