I inserted a P100 into my server, and now I can’t run my CUDA samples anymore. For example, deviceQuery dies with:
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 999
→ unknown error
Result = FAIL
I deleted the file and recompiled, and the compilation goes through, but when I run it, same problem. Others show different problems, for example BlackScholes complains with:
CUDA error at …/…/common/inc/helper_cuda.h:779 code=999(cudaErrorUnknown) “cudaGetDeviceCount(&device_count)”
Here same thing: I delete the file, recompile, the compilation succeeds, but when I run it again, same error.
So, by inserting another GPU, something with DeviceCount malfunctions.
I have three generations of cards in this, but a previous query had njuffa confirm to me that it won’t matter. I have:
device 0: GTX 980
devices 1 through 4: K80
device 5: P100 (the 16 GB PCIe version)
I use the 470.57.02 driver, that is the latest available that supports Tesla cards (and apparently also the latest that supports Kepler, it seems after this Tesla-supporting driver I can’t use my K80s anymore, can anyone confirm?).
CUDA 11.4.2 on Fedora 34.
The P100 shows up just fine in nvtop, nvidia-smi, and nvidia-settings (in fact, nvidia-settings now shows me an additional disabled virtual device – “NVIDIA VGX”. Have to learn how to use that given that the P100 has no video output).