What's the expected behavior of calling cudaGetDevice when the process has no cuda context?

youkaichao1 · June 11, 2025, 3:00am

From the documentation CUDA Runtime API :: CUDA Toolkit Documentation :

if there is no such context, it uses a “primary context.”

It seems runtime API will create a primary context if there’s no cuda context for the current process. Then I can assume cuda context will be created after i call any cuda runtime API successfully.

However, from the experiment, it seems cudaGetDevice will just give 0, without initializing a cuda context. Is this standard behavior?

This code can explain the confusion:

#include <cuda_runtime.h>
#include <iostream>
#include <unistd.h>

int main() {
    int device = 1;
    int cur_device = -1;
    
    // Get current device
    cudaError_t error = cudaGetDevice(&cur_device);
    if (error != cudaSuccess) {
        std::cerr << "Error getting current device: " << cudaGetErrorString(error) << std::endl;
        return 1;
    }
    std::cout << "Current device before change: " << cur_device << std::endl;
    
    // Set device to 0
    error = cudaSetDevice(device);
    if (error != cudaSuccess) {
        std::cerr << "Error setting device: " << cudaGetErrorString(error) << std::endl;
        return 1;
    }
    
    // Get device again to verify
    error = cudaGetDevice(&cur_device);
    if (error != cudaSuccess) {
        std::cerr << "Error getting current device after change: " << cudaGetErrorString(error) << std::endl;
        return 1;
    }
    std::cout << "Current device after change: " << cur_device << std::endl;
    
    // Sleep for 60 seconds
    std::cout << "Sleeping for 60 seconds...\n";
    sleep(60);
    
    return 0;
}

It produces:

Current device before change: 0
Current device after change: 1

However, from nvidia-smi, I can tell the process only has a cuda context in device 1, which makes “Current device before change: 0” very confusing, because this process never used device 0.

For more context, where I come from:

I’m trying to use torch.cuda.set_device(i) to make sure my process has a cuda context in device i. However, I find torch.cuda.set_device(0) silently fails to create a cuda context on device 0, while other index works.

By checking the code:

github.com/pytorch/pytorch

c10/cuda/CUDAFunctions.cpp

a2b0b2698


      
              TORCH_INTERNAL_ASSERT(
                  tmp_device >= 0 &&
                      tmp_device <= std::numeric_limits<DeviceIndex>::max(),
                  "cudaGetDevice returns invalid device ",
                  tmp_device);
              *device = static_cast<DeviceIndex>(tmp_device);
            }
            return err;
          }
          
          cudaError_t SetDevice(DeviceIndex device) {
            TORCH_CHECK(device >= 0, "device id must be positive!", device);
            targetDeviceIndex = -1;
            int cur_device = -1;
            C10_CUDA_CHECK(cudaGetDevice(&cur_device));
            if (device == cur_device) {
              return cudaSuccess;
            }
            return cudaSetDevice(device);
          }

This is because torch.cuda.set_device(0) will call cudaGetDevice first, and it has a shortcut if the returned device is already the same as requested. Then it will skip the cudaSetDevice call.

I’d like to know, if this is the intended behavior of cudaGetDevice (and documentation should be improved, pytorch team should be aware of it), or it is a bug that needs to be fixed from cuda runtime API side?

youkaichao1 · June 11, 2025, 6:02am

cross-posting the pytorch issue torch.cuda.set_device(0) behaves differently from torch.cuda.set_device(1) in terms of cuda context · Issue #155668 · pytorch/pytorch · GitHub

striker159 · June 11, 2025, 6:56am

I would say it is expected that cudaGetDevice returns 0. The programming guide states the following:

6.2.9.2. Device Selection

A host thread can set the device it operates on at any time by calling cudaSetDevice(). Device memory allocations and kernel launches are made on the currently set device; streams and events are created in association with the currently set device. If no call to cudaSetDevice() is made, the current device is device 0.

That is also what you see in ordinary single-gpu code, which does not require explicit cudaSetDevice to select device 0.

Regarding your use-case with pytorch, you could call setDevice(0) last, which should then initialize device 0 as the current id would have changed.

youkaichao1 · June 11, 2025, 9:50am

what is unexpected, is cudaGetDevice will not initialize the primary cuda context.

striker159 · June 11, 2025, 10:48am

Since CUDA 12 cudaSetDevice creates a context. I cannot find a mention that the same should be the case for cudaGetDevice . The documentation which you linked in the first post does not state that every API call will initialize the context.

Quoting from your linked documentation

Context management can be done through the driver API, but is not exposed in the runtime API. Instead, the runtime API decides itself which context to use for a thread: if a context has been made current to the calling thread through the driver API, the runtime will use that, but if there is no such context, it uses a “primary context.” Primary contexts are created as needed, one per device per process, are reference-counted, and are then destroyed when there are no more references to them.

If no context is needed, it won’t be created.

youkaichao1 · June 11, 2025, 12:14pm

well, fair, then this is just undocumented behavior details. I hope future documentation can be improved, clearly stating that cudaGetDevice returns 0 but will not initialize the context.

Robert_Crovella · June 11, 2025, 3:32pm

you can file a bug for documentation concerns.

Topic		Replies	Views
Confusion about context management by CUDA runtime CUDA Programming and Performance	3	695	December 25, 2023
cudaSetDevice always returns cudaSuccess CUDA Programming and Performance	2	1261	June 9, 2010
Quick Question on cudaSetDevice()? It does not work in my case. CUDA Programming and Performance	5	11756	November 20, 2009
cudaSetDevice question CUDA Programming and Performance	12	33281	February 3, 2009
compute-exclusive mode and cudaGetDevice(...) Always claims to be running on device 0. CUDA Programming and Performance	6	15922	July 27, 2009
cudaSetDevice() problem from pthread CUDA Programming and Performance	0	962	August 1, 2011
A question about using cudaSetDevice CUDA Programming and Performance	4	9330	November 2, 2011
cudaGetDevice unreliable CUDA Programming and Performance	2	3042	June 20, 2008
cudaGetDevice does not work on device CUDA Programming and Performance	6	3126	March 18, 2018
.NET & Cuda : cudaSetDevice fails with cudaErrorSetOnActiveProcess CUDA Programming and Performance	5	5749	January 14, 2009

What's the expected behavior of calling cudaGetDevice when the process has no cuda context?

Related topics