Abnormal Device ID

hk21 · April 13, 2022, 8:05am

I’m developing deep learing software which is executed on Windows PC with single gpu.
When I run this software for a while, suddenly it stopped.
I found it that cudaGetDevice returned abnormal value while it had been stopped.

Part of device ID logging data got from cudaGetDevice
2022/04/13 10:38:45 DeviceID:0
2022/04/13 10:39:45 DeviceID:0
2022/04/13 10:40:45 DeviceID:0
2022/04/13 10:41:45 DeviceID:0
2022/04/13 10:42:45 DeviceID:0
2022/04/13 10:43:45 DeviceID:32759
2022/04/13 10:44:45 DeviceID:32759
2022/04/13 10:45:45 DeviceID:32759
2022/04/13 10:46:45 DeviceID:32759

“DeviceID:0” is normal.

I also confirmed bellow.
・Set “CUDA_VISIBLE_DEVICES=0” didn’t help this problem.
・Device ID will be reset to 0, when a gpu driver is reinstalled.

I would appreciate it if someone let me know what is the reason and how to prevent it.

[PC]
GPU : GeForce GTX 1650
Driver : 471.11
CUDA : 11.4
OS : Windows 10

[Deep learning software]
Language : C#
Inference Engine : ONNX Runtime 1.10 (build from source with CUDA)

Robert_Crovella · April 13, 2022, 12:27pm

My guess would be you are doing something illegal on the GPU. Ordinarily, stopping the owning process and then restarting would be enough to clear that issue. However if the owning process gets stuck or doesn’t exit normally, it could result in a persistent bad state on the GPU.

Usually at that point, restarting the system will clear that up.

Another possibility is that the GPU is overheating or otherwise failing.

I would always recommend doing consistent, rigorous, proper CUDA error checking. My guess is that in teh state where you get a bad DeviceID, the cuda API call to retrieve the device ID (cudaGetDevice) is also failing. You should be checking for those errors and reporting them or handling them.

hk21 · April 13, 2022, 3:19pm

Dear Robert,

Thank you for your reply.

Actually, I’ve been testing this software at more than 10 places since last April.
Each PC has been located in a harsh environment. And I got this error at only one place.
Therefore, I think there is some possibility of the GPU overheating or failing.
So I will try to check both this possibility and the CUDA error.