Failed: CUDA driver version is insufficient for CUDA runtime version

I am struggling to resolve the CUDA version conflict between driver and runtime when running from singularity. I have used two containers (build from docker)
clara-parabricks_4.1.2-1.sif
clara-parabricks_4.2.0-1.sif
(details are below) “[PB Error 2023-Nov-14 11:27:02][src/stitchPiece_step0.cu:438] cudaSafeCall() failed: CUDA driver version is insufficient for CUDA runtime version, exiting.”
when running the singularity container on HPC gpu node.
What would I need to do to resolve this conflict? (A challenge to note is on HPC cluster users do not have root privileges )
I don’t understand why nvidia-smi shows CUDA Version shows 11.8 when singularity has the CUDA-12.2. I also looked at the requirement variable. I guess if fix the paths it might work.

Singularity> echo $NVIDIA_REQUIRE_CUDA
cuda>=12.2 brand=tesla,driver>=450,driver<451
cuda>brand=tesla,driver>=470,driver<471
cuda>brand=unknown,driver>=470,driver<471
cuda>brand=nvidia,driver>=470,driver<471
cuda>brand=nvidiartx,driver>=470,driver<471
cuda>brand=geforce,driver>=470,driver<471
cuda>brand=geforcertx,driver>=470,driver<471
cuda>brand=quadro,driver>=470,driver<471
cuda>brand=quadrortx,driver>=470,driver<471
cuda>brand=titan,driver>=470,driver<471
cuda>brand=titanrtx,driver>=470,driver<471
cuda>brand=tesla,driver>=525,driver<526
cuda>brand=unknown,driver>=525,driver<526
cuda>brand=nvidia,driver>=525,driver<526
cuda>brand=nvidiartx,driver>=525,driver<526
cuda>brand=geforce,driver>=525,driver<526
cuda>brand=geforcertx,driver>=525,driver<526
cuda>brand=quadro,driver>=525,driver<526
cuda>brand=quadrortx,driver>=525,driver<526
cuda>brand=titan,driver>=525,driver<526
cuda>brand=titanrtx,driver>=525,driver<526

Don’t understand why CUDA Version shows 11.8 when singularity has the CUDA-12.2
Singularity> nvidia-smi
Tue Nov 14 11:39:19 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … On | 00000000:86:00.0 Off | N/A |
| 30% 33C P8 13W / 250W | 1MiB / 11264MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Singularity> echo $PATH
/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Singularity> uname -a
Linux gpu0035.vampire 3.10.0-1160.71.1.el7.x86_64 #1 SMP Tue Jun 28 15:37:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Singularity> ls /usr/local/cuda*
/usr/local/cuda:
compat doc gds lib64 targets

/usr/local/cuda-12:
compat doc gds lib64 targets

/usr/local/cuda-12.2:
compat doc gds lib64 targets

Singularity> time pbrun rna_fq2bam --in-fq
Singularity> /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_g
Singularity> pu/test_data/Sample_043_END1.fastq.gz
Singularity> /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_g
Singularity> pu/test_data/Sample_043_END2.fastq.gz --genome-lib-dir
Singularity> /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_g
Singularity> pu/genome_index/ --ref
Singularity> /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_g
Singularity> pu/test_data/101bp_r110/Homo_sapiens.GRCh38.110.dna_sm.prim
Singularity> ary_assembly.fa --output-dir
Singularity> /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_g
Singularity> pu/output/ --out-bam Sample_043_END2_gpu.bam
Singularity> --read-files-command zcat
Please visit NVIDIA Clara - NVIDIA Docs for detailed documentation

[Parabricks Options Mesg]: Automatically generating ID prefix [Parabricks Options Mesg]: Read group created for /panfs/accrepfs.vampire/nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/test_data/Sample_043_END1.fastq.gz
and
/panfs/accrepfs.vampire/nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/test_data/Sample_043_END2.fastq.gz
[Parabricks Options Mesg]: @RG\tID:HYWM7BCXX161201.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HYWM7BCXX161201.1
[PB Info 2023-Nov-14 12:02:40] ------------------------------------------------------------------------------
[PB Info 2023-Nov-14 12:02:40] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2023-Nov-14 12:02:40] || Version 4.2.0-1 ||
[PB Info 2023-Nov-14 12:02:40] || star ||
[PB Info 2023-Nov-14 12:02:40] ------------------------------------------------------------------------------
[PB Info 2023-Nov-14 12:02:40] … started STAR run [PB Info 2023-Nov-14 12:02:40] … loading genome [PB Info 2023-Nov-14 12:03:55] read from genomeDir done 74.864 [PB Info 2023-Nov-14 12:03:55] Gpu num:1 Cpu thread num: 4 [PB Info 2023-Nov-14 12:03:55] … started mapping [PB Error 2023-Nov-14 12:03:55][src/stitchPiece_step0.cu:438] cudaSafeCall() failed: CUDA driver version is insufficient for CUDA runtime version, exiting.
For technical support visit Clara Parabricks v4.2.0 - NVIDIA Docs
Exiting…

Could not run rna_fq2bam
Exiting pbrun …

Best regrds,
Vaibhav

What is your driver version?

Parabricks v4.2.0-1 requires 525.60.13 or newer drivers

The nvidia-smi shows 520.61.05

How do I update this in the singularity where I don’t have admin access?

nvidia-smi -a | head -n55

==============NVSMI LOG==============

Timestamp : Wed Nov 15 13:12:26 2023

Driver Version : 520.61.05

Your admin needs to update to at least 525 or newer drivers on the host machine

Please see requirements here:

1 Like

Is it possible to install the drivers in user directory and point to them or update the container ? I will reach out to admins to see if they would e willing to install the updates and the time frame for this?

Thank you!

Best regrds,

Vaibhav

I noticed the driver versions are different based on node assigned to my job.

I tried the same task on different assigned node and got out of memory error (38GB requested based on memory usage for same task in STAR ).

Tried again with 120GB memory and 2GPUs still got the same out of memory error. Whats the reason for this error?

“[PB Error 2023-Nov-16 02:50:29][src/stitchPiece_step0.cu:438] cudaSafeCall() failed: CUDA driver version is insufficient for CUDA runtime version, exiting.”

Notice the nvidia-smi gives different version of CUDA drivers. nvcc -V command is not found.

Using the latest container version singularity shell --nv clara-parabricks_4.2.0-1.sif I am still getting the driver conflict error.

“[PB Error 2023-Nov-16 02:50:29][src/stitchPiece_step0.cu:438] cudaSafeCall() failed: CUDA driver version is insufficient for CUDA runtime version, exiting.”

[janveva1@gpu0040 ~]$ cd /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/

[janveva1@gpu0040 bulk_RNAseq_gpu]$ singularity shell --nv clara-parabricks_4.0.1-1.sif

time pbrun rna_fq2bam --in-fq /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/test_data/Sample_043_END1.fastq.gz /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/test_data/Sample_043_END2.fastq.gz --genome-lib-dir /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/genome_index/ --ref /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/test_data/101bp_r110/Homo_sapiens.GRCh38.110.dna_sm.primary_assembly.fa --output-dir /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/output/ --out-bam Sample_043_END2_gpu.bam --read-files-command zcat

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation

Parabricks Options Mesg: Automatically generating ID prefix

Parabricks Options Mesg: Read group created for

/panfs/accrepfs.vampire/nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/test_data/Sample_043_END1.fastq.gz

and

/panfs/accrepfs.vampire/nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/test_data/Sample_043_END2.fastq.gz

[PB Info 2023-Nov-16 02:25:55] ------------------------------------------------------------------------------

[PB Info 2023-Nov-16 02:25:55] || Parabricks accelerated Genomics Pipeline ||

[PB Info 2023-Nov-16 02:25:55] || Version 4.0.1-1 ||

[PB Info 2023-Nov-16 02:25:55] || star ||

[PB Info 2023-Nov-16 02:25:55] ------------------------------------------------------------------------------

[PB Info 2023-Nov-16 02:25:56] … started STAR run

[PB Info 2023-Nov-16 02:25:56] … loading genome

[PB Info 2023-Nov-16 02:28:00] read from genomeDir done 124.206

[PB Info 2023-Nov-16 02:28:00] Gpu num:2 Cpu thread num: 4

[PB Info 2023-Nov-16 02:28:00] … started mapping

[PB Info 2023-Nov-16 02:28:12] gpu free memory: 6445268992, total 11554848768

cudaSuccess

[PB Error 2023-Nov-16 02:28:12][src/stitchPiece_step0.cu:88] cudaSafeCall() failed: out of memory, exiting.

For technical support visit https://docs.nvidia.com/clara/parabricks/4.0.0/Help.html

Exiting…

Could not run rna_fq2bam


Singularity> nvidia-smi
Thu Nov 16 02:46:13 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … On | 00000000:3B:00.0 Off | N/A |
| 29% 31C P8 13W / 250W | 1MiB / 11264MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … On | 00000000:5E:00.0 Off | N/A |
| 29% 30C P8 14W / 250W | 1MiB / 11264MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |

Singularity> nvcc -V
bash: nvcc: command not found


[janveva1@gpu0040 bulk_RNAseq_gpu]$ singularity shell --nv clara-parabricks_4.2.0-1.sif
Singularity> time pbrun rna_fq2bam --in-fq /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/test_data/Sample_043_END1.fastq.gz /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/test_data/Sample_043_END2.fastq.gz --genome-lib-dir /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/genome_index/ --ref /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/test_data/101bp_r110/Homo_sapiens.GRCh38.110.dna_sm.primary_assembly.fa --output-dir /nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/output/ --out-bam Sample_043_END2_gpu.bam --read-files-command zcat
Please visit NVIDIA Clara - NVIDIA Docs for detailed documentation

Parabricks Options Mesg: Automatically generating ID prefix
Parabricks Options Mesg: Read group created for
/panfs/accrepfs.vampire/nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/test_data/Sample_043_END1.fastq.gz
and
/panfs/accrepfs.vampire/nobackup/h_vmac/user/janveva1/Project/RNAseq/bulk_RNAseq_gpu/test_data/Sample_043_END2.fastq.gz
Parabricks Options Mesg: @RG\tID:HYWM7BCXX161201.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HYWM7BCXX161201.1
[PB Info 2023-Nov-16 02:49:59] ------------------------------------------------------------------------------
[PB Info 2023-Nov-16 02:49:59] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2023-Nov-16 02:49:59] || Version 4.2.0-1 ||
[PB Info 2023-Nov-16 02:49:59] || star ||
[PB Info 2023-Nov-16 02:49:59] ------------------------------------------------------------------------------
[PB Info 2023-Nov-16 02:49:59] … started STAR run
[PB Info 2023-Nov-16 02:49:59] … loading genome
[PB Info 2023-Nov-16 02:50:29] read from genomeDir done 29.094
[PB Info 2023-Nov-16 02:50:29] Gpu num:2 Cpu thread num: 4
[PB Info 2023-Nov-16 02:50:29] … started mapping
[PB Error 2023-Nov-16 02:50:29][src/stitchPiece_step0.cu:438] cudaSafeCall() failed: CUDA driver version is insufficient for CUDA runtime version, exiting.
For technical support visit Clara Parabricks v4.2.0 - NVIDIA Docs
Exiting…

Could not run rna_fq2bam
Exiting pbrun …


Hi @vaibhav.a.janve

You seem to be using a GPU with less than 16 GB of memory.

One of the requirement for all Parabricks versions is :
Any NVIDIA GPU that supports CUDA architecture 60, 70, 75, or 80 and has at least 16GB of GPU RAM.

Please refer to the requirements here :

Best

1 Like

Thank you @mdemouth! As I requested 120GB and 2GPU’s from cluster for this test. I assumed it would meet the 16GB requirement.

How to find the NVIDIA GPU RAM size?

This isn’t clear from the output of nvidia-smi. I only see

FB Memory Usage

Total : 11264 MiB

Reserved : 244 MiB

Used : 1 MiB

Free : 11018 MiB

BAR1 Memory Usage

Total : 256 MiB

Used : 15 MiB

Free : 241 MiB

As relevant section.

Here’s the full output for nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Thu Nov 16 12:14:37 2023

Driver Version : 510.47.03

CUDA Version : 11.6

Attached GPUs : 2

GPU 00000000:3B:00.0

Product Name : NVIDIA GeForce RTX 2080 Ti

Product Brand : GeForce

Product Architecture : Turing

Display Mode : Disabled

Display Active : Disabled

Persistence Mode : Enabled

MIG Mode

Current : N/A

Pending : N/A

Accounting Mode : Disabled

Accounting Mode Buffer Size : 4000

Driver Model

Current : N/A

Pending : N/A

Serial Number : N/A

GPU UUID : GPU-74874d65-8321-5043-ad54-e91b0af95df4

Minor Number : 0

VBIOS Version : 90.02.17.00.5F

MultiGPU Board : No

Board ID : 0x3b00

GPU Part Number : N/A

Module ID : 0

Inforom Version

Image Version : G001.0000.02.04

OEM Object : 1.1

ECC Object : N/A

Power Management Object : N/A

GPU Operation Mode

Current : N/A

Pending : N/A

GSP Firmware Version : N/A

GPU Virtualization Mode

Virtualization Mode : None

Host VGPU Mode : N/A

IBMNPU

Relaxed Ordering Mode : N/A

PCI

Bus : 0x3B

Device : 0x00

Domain : 0x0000

Device Id : 0x1E0410DE

Bus Id : 00000000:3B:00.0

Sub System Id : 0x250319DA

GPU Link Info

PCIe Generation

Max : 3

Current : 1

Link Width

Max : 16x

Current : 16x

Bridge Chip

Type : N/A

Firmware : N/A

Replays Since Reset : 0

Replay Number Rollovers : 0

Tx Throughput : 0 KB/s

Rx Throughput : 0 KB/s

Fan Speed : 29 %

Performance State : P8

Clocks Throttle Reasons

Idle : Active

Applications Clocks Setting : Not Active

SW Power Cap : Not Active

HW Slowdown : Not Active

HW Thermal Slowdown : Not Active

HW Power Brake Slowdown : Not Active

Sync Boost : Not Active

SW Thermal Slowdown : Not Active

Display Clock Setting : Not Active

FB Memory Usage

Total : 11264 MiB

Reserved : 244 MiB

Used : 1 MiB

Free : 11018 MiB

BAR1 Memory Usage

Total : 256 MiB

Used : 15 MiB

Free : 241 MiB

Compute Mode : Default

Utilization

Gpu : 0 %

Memory : 0 %

Encoder : 0 %

Decoder : 0 %

Encoder Stats

Active Sessions : 0

Average FPS : 0

Average Latency : 0

FBC Stats

Active Sessions : 0

Average FPS : 0

Average Latency : 0

Ecc Mode

Current : N/A

Pending : N/A

ECC Errors

Volatile

SRAM Correctable : N/A

SRAM Uncorrectable : N/A

DRAM Correctable : N/A

DRAM Uncorrectable : N/A

Aggregate

SRAM Correctable : N/A

SRAM Uncorrectable : N/A

DRAM Correctable : N/A

DRAM Uncorrectable : N/A

Retired Pages

Single Bit ECC : N/A

Double Bit ECC : N/A

Pending Page Blacklist : N/A

Remapped Rows : N/A

Temperature

GPU Current Temp : 31 C

GPU Shutdown Temp : 94 C

GPU Slowdown Temp : 91 C

GPU Max Operating Temp : 89 C

GPU Target Temperature : 84 C

Memory Current Temp : N/A

Memory Max Operating Temp : N/A

Power Readings

Power Management : Supported

Power Draw : 13.65 W

Power Limit : 250.00 W

Default Power Limit : 250.00 W

Enforced Power Limit : 250.00 W

Min Power Limit : 100.00 W

Max Power Limit : 280.00 W

Clocks

Graphics : 300 MHz

SM : 300 MHz

Memory : 405 MHz

Video : 540 MHz

Applications Clocks

Graphics : N/A

Memory : N/A

Default Applications Clocks

Graphics : N/A

Memory : N/A

Max Clocks

Graphics : 2100 MHz

SM : 2100 MHz

Memory : 7000 MHz

Video : 1950 MHz

Max Customer Boost Clocks

Graphics : N/A

Clock Policy

Auto Boost : N/A

Auto Boost Default : N/A

Voltage

Graphics : N/A

Processes : None

GPU 00000000:5E:00.0

Product Name : NVIDIA GeForce RTX 2080 Ti

Product Brand : GeForce

Product Architecture : Turing

Display Mode : Disabled

Display Active : Disabled

Persistence Mode : Enabled

MIG Mode

Current : N/A

Pending : N/A

Accounting Mode : Disabled

Accounting Mode Buffer Size : 4000

Driver Model

Current : N/A

Pending : N/A

Serial Number : N/A

GPU UUID : GPU-de63eff9-02a0-18c2-64b1-90d8dd631b69

Minor Number : 1

VBIOS Version : 90.02.17.00.5F

MultiGPU Board : No

Board ID : 0x5e00

GPU Part Number : N/A

Module ID : 0

Inforom Version

Image Version : G001.0000.02.04

OEM Object : 1.1

ECC Object : N/A

Power Management Object : N/A

GPU Operation Mode

Current : N/A

Pending : N/A

GSP Firmware Version : N/A

GPU Virtualization Mode

Virtualization Mode : None

Host VGPU Mode : N/A

IBMNPU

Relaxed Ordering Mode : N/A

PCI

Bus : 0x5E

Device : 0x00

Domain : 0x0000

Device Id : 0x1E0410DE

Bus Id : 00000000:5E:00.0

Sub System Id : 0x250319DA

GPU Link Info

PCIe Generation

Max : 3

Current : 1

Link Width

Max : 16x

Current : 16x

Bridge Chip

Type : N/A

Firmware : N/A

Replays Since Reset : 0

Replay Number Rollovers : 0

Tx Throughput : 0 KB/s

Rx Throughput : 0 KB/s

Fan Speed : 29 %

Performance State : P8

Clocks Throttle Reasons

Idle : Active

Applications Clocks Setting : Not Active

SW Power Cap : Not Active

HW Slowdown : Not Active

HW Thermal Slowdown : Not Active

HW Power Brake Slowdown : Not Active

Sync Boost : Not Active

SW Thermal Slowdown : Not Active

Display Clock Setting : Not Active

FB Memory Usage

Total : 11264 MiB

Reserved : 244 MiB

Used : 1 MiB

Free : 11018 MiB

BAR1 Memory Usage

Total : 256 MiB

Used : 5 MiB

Free : 251 MiB

Compute Mode : Default

Utilization

Gpu : 0 %

Memory : 0 %

Encoder : 0 %

Decoder : 0 %

Encoder Stats

Active Sessions : 0

Average FPS : 0

Average Latency : 0

FBC Stats

Active Sessions : 0

Average FPS : 0

Average Latency : 0

Ecc Mode

Current : N/A

Pending : N/A

ECC Errors

Volatile

SRAM Correctable : N/A

SRAM Uncorrectable : N/A

DRAM Correctable : N/A

DRAM Uncorrectable : N/A

Aggregate

SRAM Correctable : N/A

SRAM Uncorrectable : N/A

DRAM Correctable : N/A

DRAM Uncorrectable : N/A

Retired Pages

Single Bit ECC : N/A

Double Bit ECC : N/A

Pending Page Blacklist : N/A

Remapped Rows : N/A

Temperature

GPU Current Temp : 31 C

GPU Shutdown Temp : 94 C

GPU Slowdown Temp : 91 C

GPU Max Operating Temp : 89 C

GPU Target Temperature : 84 C

Memory Current Temp : N/A

Memory Max Operating Temp : N/A

Power Readings

Power Management : Supported

Power Draw : 14.25 W

Power Limit : 250.00 W

Default Power Limit : 250.00 W

Clocks

Graphics : 300 MHz

SM : 300 MHz

Memory : 405 MHz

Video : 540 MHz

Applications Clocks

Graphics : N/A

Memory : N/A

Default Applications Clocks

Graphics : N/A

Memory : N/A

Max Clocks

Graphics : 2100 MHz

SM : 2100 MHz

Memory : 7000 MHz

Video : 1950 MHz

Max Customer Boost Clocks

Graphics : N/A

Clock Policy

Auto Boost : N/A

Auto Boost Default : N/A

Voltage

Graphics : N/A

Processes : None

Is there a version of clara-parabricks that would run with 11GB of memory?