deviceQuery reports: cudaGetDeviceCount returned 10 -> invalid device ordinal / test results... F

I’ve several compute nodes in a cluster, each with 3 M2050s; NVIDIA driver 295.41 and CUDA Toolkit 4.2. Everything looks great according to deviceQuery:

 # ./deviceQuery
 [deviceQuery] starting...

 ./deviceQuery Starting...

  CUDA Device Query (Runtime API) version (CUDART static linking)

 Found 3 CUDA Capable device(s)

 Device 0: "Tesla M2050"
   CUDA Driver Version / Runtime Version          4.2 / 4.2
   CUDA Capability Major/Minor version number:    2.0
   Total amount of global memory:                 3072 MBytes (3220897792 bytes)
   (14) Multiprocessors x ( 32) CUDA Cores/MP:    448 CUDA Cores
   GPU Clock rate:                                1147 MHz (1.15 GHz)
   Memory Clock rate:                             1546 Mhz
   Memory Bus Width:                              384-bit
   L2 Cache Size:                                 786432 bytes
 .
 .
 .

Eventually the compute nodes will get into a state where both deviceQuery or deviceQueryDrv fail:

 # ./deviceQuery
 [deviceQuery] starting...

 ./deviceQuery Starting...

  CUDA Device Query (Runtime API) version (CUDART static linking)

 cudaGetDeviceCount returned 10
 -> invalid device ordinal
 [deviceQuery] test results...
 FAILED

 > exiting in 3 seconds: 3...2...1...done!

 [root@pyt-c18 release]# ./deviceQueryDrv
 [deviceQueryDrv] starting...

 CUDA Device Query (Driver API) statically linked version
 cuInit(0) returned 101
 -> CUDA_ERROR_INVALID_VALUE
 [deviceQueryDrv] test results...
 FAILED

 > exiting in 3 seconds: 3...2...1...done!

We don’t know how the GPUs get into such a state, but the only way that we have found to resolve this is through a reboot. Once the affected compute nodes are rebooted and they come up fresh, the deviceQuery reports fine again.

Does anyone know what causes this? Also, does anyone know if we can reset the devices / drivers when they get in such a state so that we can avoid having to reboot our compute nodes all the time?

Check the GPU temperatures… it’s not the ‘falling off the bus’ message that I’ve seen previously on the forums, but it might be heat-related – if you can, get more airflow to the GPUs and see if the problem goes away.

As for how to fix it without a reboot, try a GPU reset w/ nvidia-smi – I believe that should be supported on the M2050: https://developer.nvidia.com/sites/default/files/akamai/cuda/files/CUDADownloads/NVML_cuda5/nvidia-smi.4.304.pdf

Check out also this: https://developer.nvidia.com/tesla-deployment-kit
For the Linux downloads they have something called nvidia-healthmon, which should be of use.