[Solved] CUDA driver initialization failed - 2x RTX 5090

joasanna · May 28, 2025, 12:17pm

Hi, I have a problem when using both RTX 5090 cards.

import torch
torch.cuda.is_available()
/home/joasanna/deeplearning/lib/python3.12/site-packages/torch/cuda/init.py:174: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
import os
os.environ[‘CUDA_VISIBLE_DEVICES’]=“0,1”
os.environ[‘PYTORCH_NVML_BASED_CUDA_CHECK’]=‘1’
torch.cuda.is_available()
True
print(torch.rand(10).cuda())
Traceback (most recent call last):
File “”, line 1, in
File “/home/joasanna/deeplearning/lib/python3.12/site-packages/torch/cuda/init.py”, line 372, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

But then when I only use one is OK

Python 3.12.7 (main, Feb 4 2025, 14:46:03) [GCC 14.2.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
import os
os.environ[‘CUDA_VISIBLE_DEVICES’]=“0”
import torch
torch.cuda.is_available()
True
print(torch.rand(10).cuda())
tensor([0.8718, 0.6391, 0.2072, 0.8781, 0.8859, 0.1336, 0.8040, 0.6449, 0.6668,
0.9559], device=‘cuda:0’)

Torch Versions:

torch 2.7.0+cu128
torchaudio 2.7.0+cu128
torchmetrics 1.6.1
torchvision 0.22.0+cu128

CUDA:

NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9

System info:

Distributor ID: Ubuntu
Description: Ubuntu 24.10
Release: 24.10
Codename: oracular

njuffa · May 28, 2025, 12:32pm

Does nvidia-smi report both GPUs as being present and operational, with no obvious anomalies on the status page? Instead of testing via PyTorch, have you created a simple CUDA application to target either of the GPUs?

Honestly, at this point I am not convinced that the problem is with CUDA rather than PyTorch. The first thing we would want to find out is whether the two GPUs are visible to the NVML layer of the NVIDIA driver package and secondly whether both GPUs are operational using CUDA driver and runtime.

joasanna · May 28, 2025, 1:10pm

nvidia-smi looks ok:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   55C    P8             16W /  600W |      39MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5090        Off |   00000000:04:00.0 Off |                  N/A |
|  0%   52C    P8             19W /  600W |      18MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2191      G   /usr/bin/gnome-shell                     10MiB |
|    0   N/A  N/A            2242      G   /usr/bin/Xwayland                         8MiB |
|    1   N/A  N/A            2191      G   /usr/bin/gnome-shell                      6MiB |
+-----------------------------------------------------------------------------------------+

Simple cuda app:

#include <stdio.h>
#include <cuda_runtime.h>
#include <math.h>

// Vector addition kernel
__global__ void vectorAdd(const float *A, const float *B, float *C, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

int main() {
    int deviceCount = 0;
    cudaError_t err = cudaGetDeviceCount(&deviceCount);
    if (err != cudaSuccess) {
        printf("cudaGetDeviceCount returned %d: %s\n", err, cudaGetErrorString(err));
        return 1;
    }
    printf("Detected %d CUDA capable device(s)\n", deviceCount);

    const int N = 1 << 20;            // Number of elements (1M)
    size_t size = N * sizeof(float);

    // Loop over each device
    for (int dev = 0; dev < deviceCount; ++dev) {
        cudaSetDevice(dev);
        printf("\n=== Running on Device %d ===\n", dev);

        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, dev);
        printf("Device Name: %s\n", prop.name);

        // Allocate host memory
        float *h_A = (float*)malloc(size);
        float *h_B = (float*)malloc(size);
        float *h_C = (float*)malloc(size);
        for (int i = 0; i < N; ++i) {
            h_A[i] = i;
            h_B[i] = i * 2;
        }

        // Allocate device memory
        float *d_A, *d_B, *d_C;
        cudaMalloc((void**)&d_A, size);
        cudaMalloc((void**)&d_B, size);
        cudaMalloc((void**)&d_C, size);

        // Copy data to device
        cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
        cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

        // Launch kernel
        int threadsPerBlock = 256;
        int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
        vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
        cudaDeviceSynchronize();

        // Copy result back to host
        cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

        // Validate results
        int errorCount = 0;
        for (int i = 0; i < N; ++i) {
            float expected = h_A[i] + h_B[i];
            if (fabs(h_C[i] - expected) > 1e-5) {
                if (errorCount < 10) {
                    printf("Mismatch at index %d: %f (got) vs %f (expected)\n", i, h_C[i], expected);
                }
                errorCount++;
            }
        }
        if (errorCount == 0) {
            printf("Result = PASS on device %d\n", dev);
        } else {
            printf("Result = FAIL on device %d (errors: %d)\n", dev, errorCount);
        }

        // Cleanup
        cudaFree(d_A);
        cudaFree(d_B);
        cudaFree(d_C);
        free(h_A);
        free(h_B);
        free(h_C);
    }

    return 0;
}

nvcc:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Output:

$ nvcc simple_multi_gpu_test.cu -o test_multi_gpu
./test_multi_gpu
cudaGetDeviceCount returned 3: initialization error

joasanna · May 28, 2025, 3:55pm

Fixed it following https://forums.developer.nvidia.com/t/blackwell-pro-failing-cuda-simplemultigpu-sample/333529/10 last post.
In https://docs.nvidia.com/cuda/archive/12.8.1/cuda-toolkit-release-notes/index.html#known-issues-and-limitations you can solve the 6.8 kernel problem in two ways. I have solved it with this one

Option 2: Disable HMM for UVM

Create or edit /etc/modprobe.d/uvm.conf.

Add or update the following line:

options nvidia_uvm uvm_disable_hmm=1
Unload and reload the nvidia_uvm kernel module or reboot the system:

sudo modprobe -r nvidia_uvm
sudo modprobe nvidia_uvm

system · June 11, 2025, 3:55pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cuda initialization error using two RTX 5090 GPUs CUDA Programming and Performance	0	141	April 25, 2025
Rtx 5090 GPU - Hardware	4	2161	May 19, 2025
Strange problem with CUDA on 3 GPUS (5090, 6000 Ada, RTX8000) CUDA Setup and Installation cuda , pytorch	2	285	June 23, 2025
Ubuntu, CUDA 9, dual GTX1070, both (either) recognized, but can only initialize/use one CUDA Setup and Installation	2	1342	August 2, 2018
Error running pytorch on RTX3090/3060 Frameworks cuda , pytorch , python	0	933	January 13, 2023
RTX 3090 graphics card driver version, CUDA version, cudnn version, tf and pytorch versions CUDA Setup and Installation cuda , tensorflow , pytorch , python , nvidia-smi , rtx	0	594	June 4, 2024
Torch crashes driver on H100 CUDA Setup and Installation kernel	1	37	June 27, 2025
Newbie 5090 passing CUDA_LAUNCH_BLOCKING=1 problem CUDA Programming and Performance	9	1287	March 8, 2025
RTX 5090 - ERROR: GPU:1: Error while waiting for GPU progress Drivers - Linux, Windows, MacOS	2	121	July 17, 2025
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize TensorRT	2	2475	February 22, 2024

[Solved] CUDA driver initialization failed - 2x RTX 5090

Related topics