From the documentation CUDA Runtime API :: CUDA Toolkit Documentation :
if there is no such context, it uses a “primary context.”
It seems runtime API will create a primary context if there’s no cuda context for the current process. Then I can assume cuda context will be created after i call any cuda runtime API successfully.
However, from the experiment, it seems cudaGetDevice
will just give 0
, without initializing a cuda context. Is this standard behavior?
This code can explain the confusion:
#include <cuda_runtime.h>
#include <iostream>
#include <unistd.h>
int main() {
int device = 1;
int cur_device = -1;
// Get current device
cudaError_t error = cudaGetDevice(&cur_device);
if (error != cudaSuccess) {
std::cerr << "Error getting current device: " << cudaGetErrorString(error) << std::endl;
return 1;
}
std::cout << "Current device before change: " << cur_device << std::endl;
// Set device to 0
error = cudaSetDevice(device);
if (error != cudaSuccess) {
std::cerr << "Error setting device: " << cudaGetErrorString(error) << std::endl;
return 1;
}
// Get device again to verify
error = cudaGetDevice(&cur_device);
if (error != cudaSuccess) {
std::cerr << "Error getting current device after change: " << cudaGetErrorString(error) << std::endl;
return 1;
}
std::cout << "Current device after change: " << cur_device << std::endl;
// Sleep for 60 seconds
std::cout << "Sleeping for 60 seconds...\n";
sleep(60);
return 0;
}
It produces:
Current device before change: 0
Current device after change: 1
However, from nvidia-smi
, I can tell the process only has a cuda context in device 1, which makes “Current device before change: 0” very confusing, because this process never used device 0.
For more context, where I come from:
I’m trying to use torch.cuda.set_device(i)
to make sure my process has a cuda context in device i. However, I find torch.cuda.set_device(0)
silently fails to create a cuda context on device 0, while other index works.
By checking the code:
This is because torch.cuda.set_device(0)
will call cudaGetDevice
first, and it has a shortcut if the returned device is already the same as requested. Then it will skip the cudaSetDevice
call.
I’d like to know, if this is the intended behavior of cudaGetDevice
(and documentation should be improved, pytorch team should be aware of it), or it is a bug that needs to be fixed from cuda runtime API side?