What's the expected behavior of calling cudaGetDevice when the process has no cuda context?

From the documentation CUDA Runtime API :: CUDA Toolkit Documentation :

if there is no such context, it uses a “primary context.”

It seems runtime API will create a primary context if there’s no cuda context for the current process. Then I can assume cuda context will be created after i call any cuda runtime API successfully.

However, from the experiment, it seems cudaGetDevice will just give 0, without initializing a cuda context. Is this standard behavior?

This code can explain the confusion:

#include <cuda_runtime.h>
#include <iostream>
#include <unistd.h>

int main() {
    int device = 1;
    int cur_device = -1;
    
    // Get current device
    cudaError_t error = cudaGetDevice(&cur_device);
    if (error != cudaSuccess) {
        std::cerr << "Error getting current device: " << cudaGetErrorString(error) << std::endl;
        return 1;
    }
    std::cout << "Current device before change: " << cur_device << std::endl;
    
    // Set device to 0
    error = cudaSetDevice(device);
    if (error != cudaSuccess) {
        std::cerr << "Error setting device: " << cudaGetErrorString(error) << std::endl;
        return 1;
    }
    
    // Get device again to verify
    error = cudaGetDevice(&cur_device);
    if (error != cudaSuccess) {
        std::cerr << "Error getting current device after change: " << cudaGetErrorString(error) << std::endl;
        return 1;
    }
    std::cout << "Current device after change: " << cur_device << std::endl;
    
    // Sleep for 60 seconds
    std::cout << "Sleeping for 60 seconds...\n";
    sleep(60);
    
    return 0;
}

It produces:

Current device before change: 0
Current device after change: 1

However, from nvidia-smi, I can tell the process only has a cuda context in device 1, which makes “Current device before change: 0” very confusing, because this process never used device 0.

For more context, where I come from:

I’m trying to use torch.cuda.set_device(i) to make sure my process has a cuda context in device i. However, I find torch.cuda.set_device(0) silently fails to create a cuda context on device 0, while other index works.

By checking the code:

This is because torch.cuda.set_device(0) will call cudaGetDevice first, and it has a shortcut if the returned device is already the same as requested. Then it will skip the cudaSetDevice call.

I’d like to know, if this is the intended behavior of cudaGetDevice (and documentation should be improved, pytorch team should be aware of it), or it is a bug that needs to be fixed from cuda runtime API side?

cross-posting the pytorch issue torch.cuda.set_device(0) behaves differently from torch.cuda.set_device(1) in terms of cuda context · Issue #155668 · pytorch/pytorch · GitHub

I would say it is expected that cudaGetDevice returns 0. The programming guide states the following:

6.2.9.2. Device Selection

A host thread can set the device it operates on at any time by calling cudaSetDevice(). Device memory allocations and kernel launches are made on the currently set device; streams and events are created in association with the currently set device. If no call to cudaSetDevice() is made, the current device is device 0.

That is also what you see in ordinary single-gpu code, which does not require explicit cudaSetDevice to select device 0.

Regarding your use-case with pytorch, you could call setDevice(0) last, which should then initialize device 0 as the current id would have changed.

1 Like

what is unexpected, is cudaGetDevice will not initialize the primary cuda context.

Since CUDA 12 cudaSetDevice creates a context. I cannot find a mention that the same should be the case for cudaGetDevice . The documentation which you linked in the first post does not state that every API call will initialize the context.

Quoting from your linked documentation

Context management can be done through the driver API, but is not exposed in the runtime API. Instead, the runtime API decides itself which context to use for a thread: if a context has been made current to the calling thread through the driver API, the runtime will use that, but if there is no such context, it uses a “primary context.” Primary contexts are created as needed, one per device per process, are reference-counted, and are then destroyed when there are no more references to them.

If no context is needed, it won’t be created.

1 Like

well, fair, then this is just undocumented behavior details. I hope future documentation can be improved, clearly stating that cudaGetDevice returns 0 but will not initialize the context.

you can file a bug for documentation concerns.