I’ve several compute nodes in a cluster, each with 3 M2050s; NVIDIA driver 295.41 and CUDA Toolkit 4.2. Everything looks great according to deviceQuery:
# ./deviceQuery [deviceQuery] starting... ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Found 3 CUDA Capable device(s) Device 0: "Tesla M2050" CUDA Driver Version / Runtime Version 4.2 / 4.2 CUDA Capability Major/Minor version number: 2.0 Total amount of global memory: 3072 MBytes (3220897792 bytes) (14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA Cores GPU Clock rate: 1147 MHz (1.15 GHz) Memory Clock rate: 1546 Mhz Memory Bus Width: 384-bit L2 Cache Size: 786432 bytes . . .
Eventually the compute nodes will get into a state where both deviceQuery or deviceQueryDrv fail:
# ./deviceQuery [deviceQuery] starting... ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) cudaGetDeviceCount returned 10 -> invalid device ordinal [deviceQuery] test results... FAILED > exiting in 3 seconds: 3...2...1...done! [root@pyt-c18 release]# ./deviceQueryDrv [deviceQueryDrv] starting... CUDA Device Query (Driver API) statically linked version cuInit(0) returned 101 -> CUDA_ERROR_INVALID_VALUE [deviceQueryDrv] test results... FAILED > exiting in 3 seconds: 3...2...1...done!
We don’t know how the GPUs get into such a state, but the only way that we have found to resolve this is through a reboot. Once the affected compute nodes are rebooted and they come up fresh, the deviceQuery reports fine again.
Does anyone know what causes this? Also, does anyone know if we can reset the devices / drivers when they get in such a state so that we can avoid having to reboot our compute nodes all the time?