[Solved] CUDA driver initialization failed - 2x RTX 5090

Hi, I have a problem when using both RTX 5090 cards.

import torch
torch.cuda.is_available()
/home/joasanna/deeplearning/lib/python3.12/site-packages/torch/cuda/init.py:174: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
import os
os.environ[‘CUDA_VISIBLE_DEVICES’]=“0,1”
os.environ[‘PYTORCH_NVML_BASED_CUDA_CHECK’]=‘1’
torch.cuda.is_available()
True
print(torch.rand(10).cuda())
Traceback (most recent call last):
File “”, line 1, in
File “/home/joasanna/deeplearning/lib/python3.12/site-packages/torch/cuda/init.py”, line 372, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

But then when I only use one is OK

Python 3.12.7 (main, Feb 4 2025, 14:46:03) [GCC 14.2.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
import os
os.environ[‘CUDA_VISIBLE_DEVICES’]=“0”
import torch
torch.cuda.is_available()
True
print(torch.rand(10).cuda())
tensor([0.8718, 0.6391, 0.2072, 0.8781, 0.8859, 0.1336, 0.8040, 0.6449, 0.6668,
0.9559], device=‘cuda:0’)

Torch Versions:

torch 2.7.0+cu128
torchaudio 2.7.0+cu128
torchmetrics 1.6.1
torchvision 0.22.0+cu128

CUDA:

NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9

System info:

Distributor ID: Ubuntu
Description: Ubuntu 24.10
Release: 24.10
Codename: oracular

Does nvidia-smi report both GPUs as being present and operational, with no obvious anomalies on the status page? Instead of testing via PyTorch, have you created a simple CUDA application to target either of the GPUs?

Honestly, at this point I am not convinced that the problem is with CUDA rather than PyTorch. The first thing we would want to find out is whether the two GPUs are visible to the NVML layer of the NVIDIA driver package and secondly whether both GPUs are operational using CUDA driver and runtime.

nvidia-smi looks ok:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   55C    P8             16W /  600W |      39MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5090        Off |   00000000:04:00.0 Off |                  N/A |
|  0%   52C    P8             19W /  600W |      18MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2191      G   /usr/bin/gnome-shell                     10MiB |
|    0   N/A  N/A            2242      G   /usr/bin/Xwayland                         8MiB |
|    1   N/A  N/A            2191      G   /usr/bin/gnome-shell                      6MiB |
+-----------------------------------------------------------------------------------------+

Simple cuda app:

#include <stdio.h>
#include <cuda_runtime.h>
#include <math.h>

// Vector addition kernel
__global__ void vectorAdd(const float *A, const float *B, float *C, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

int main() {
    int deviceCount = 0;
    cudaError_t err = cudaGetDeviceCount(&deviceCount);
    if (err != cudaSuccess) {
        printf("cudaGetDeviceCount returned %d: %s\n", err, cudaGetErrorString(err));
        return 1;
    }
    printf("Detected %d CUDA capable device(s)\n", deviceCount);

    const int N = 1 << 20;            // Number of elements (1M)
    size_t size = N * sizeof(float);

    // Loop over each device
    for (int dev = 0; dev < deviceCount; ++dev) {
        cudaSetDevice(dev);
        printf("\n=== Running on Device %d ===\n", dev);

        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, dev);
        printf("Device Name: %s\n", prop.name);

        // Allocate host memory
        float *h_A = (float*)malloc(size);
        float *h_B = (float*)malloc(size);
        float *h_C = (float*)malloc(size);
        for (int i = 0; i < N; ++i) {
            h_A[i] = i;
            h_B[i] = i * 2;
        }

        // Allocate device memory
        float *d_A, *d_B, *d_C;
        cudaMalloc((void**)&d_A, size);
        cudaMalloc((void**)&d_B, size);
        cudaMalloc((void**)&d_C, size);

        // Copy data to device
        cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
        cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

        // Launch kernel
        int threadsPerBlock = 256;
        int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
        vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
        cudaDeviceSynchronize();

        // Copy result back to host
        cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

        // Validate results
        int errorCount = 0;
        for (int i = 0; i < N; ++i) {
            float expected = h_A[i] + h_B[i];
            if (fabs(h_C[i] - expected) > 1e-5) {
                if (errorCount < 10) {
                    printf("Mismatch at index %d: %f (got) vs %f (expected)\n", i, h_C[i], expected);
                }
                errorCount++;
            }
        }
        if (errorCount == 0) {
            printf("Result = PASS on device %d\n", dev);
        } else {
            printf("Result = FAIL on device %d (errors: %d)\n", dev, errorCount);
        }

        // Cleanup
        cudaFree(d_A);
        cudaFree(d_B);
        cudaFree(d_C);
        free(h_A);
        free(h_B);
        free(h_C);
    }

    return 0;
}

nvcc:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Output:

$ nvcc simple_multi_gpu_test.cu -o test_multi_gpu
./test_multi_gpu
cudaGetDeviceCount returned 3: initialization error

Fixed it following https://forums.developer.nvidia.com/t/blackwell-pro-failing-cuda-simplemultigpu-sample/333529/10 last post.
In https://docs.nvidia.com/cuda/archive/12.8.1/cuda-toolkit-release-notes/index.html#known-issues-and-limitations you can solve the 6.8 kernel problem in two ways. I have solved it with this one

Option 2: Disable HMM for UVM

Create or edit /etc/modprobe.d/uvm.conf.

Add or update the following line:

options nvidia_uvm uvm_disable_hmm=1
Unload and reload the nvidia_uvm kernel module or reboot the system:

sudo modprobe -r nvidia_uvm
sudo modprobe nvidia_uvm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.