deviceQuery reports: cudaGetDeviceCount returned 10 -> invalid device ordinal / test results... F

jehalter · July 1, 2013, 8:04pm

I’ve several compute nodes in a cluster, each with 3 M2050s; NVIDIA driver 295.41 and CUDA Toolkit 4.2. Everything looks great according to deviceQuery:

 # ./deviceQuery
 [deviceQuery] starting...

 ./deviceQuery Starting...

  CUDA Device Query (Runtime API) version (CUDART static linking)

 Found 3 CUDA Capable device(s)

 Device 0: "Tesla M2050"
   CUDA Driver Version / Runtime Version          4.2 / 4.2
   CUDA Capability Major/Minor version number:    2.0
   Total amount of global memory:                 3072 MBytes (3220897792 bytes)
   (14) Multiprocessors x ( 32) CUDA Cores/MP:    448 CUDA Cores
   GPU Clock rate:                                1147 MHz (1.15 GHz)
   Memory Clock rate:                             1546 Mhz
   Memory Bus Width:                              384-bit
   L2 Cache Size:                                 786432 bytes
 .
 .
 .

Eventually the compute nodes will get into a state where both deviceQuery or deviceQueryDrv fail:

 # ./deviceQuery
 [deviceQuery] starting...

 ./deviceQuery Starting...

  CUDA Device Query (Runtime API) version (CUDART static linking)

 cudaGetDeviceCount returned 10
 -> invalid device ordinal
 [deviceQuery] test results...
 FAILED

 > exiting in 3 seconds: 3...2...1...done!

 [root@pyt-c18 release]# ./deviceQueryDrv
 [deviceQueryDrv] starting...

 CUDA Device Query (Driver API) statically linked version
 cuInit(0) returned 101
 -> CUDA_ERROR_INVALID_VALUE
 [deviceQueryDrv] test results...
 FAILED

 > exiting in 3 seconds: 3...2...1...done!

We don’t know how the GPUs get into such a state, but the only way that we have found to resolve this is through a reboot. Once the affected compute nodes are rebooted and they come up fresh, the deviceQuery reports fine again.

Does anyone know what causes this? Also, does anyone know if we can reset the devices / drivers when they get in such a state so that we can avoid having to reboot our compute nodes all the time?

vacaloca · July 2, 2013, 9:46am

Check the GPU temperatures… it’s not the ‘falling off the bus’ message that I’ve seen previously on the forums, but it might be heat-related – if you can, get more airflow to the GPUs and see if the problem goes away.

As for how to fix it without a reboot, try a GPU reset w/ nvidia-smi – I believe that should be supported on the M2050: [url]https://developer.nvidia.com/sites/default/files/akamai/cuda/files/CUDADownloads/NVML_cuda5/nvidia-smi.4.304.pdf[/url]

Check out also this: [url]https://developer.nvidia.com/tesla-deployment-kit[/url]
For the Linux downloads they have something called nvidia-healthmon, which should be of use.

Topic		Replies	Views
deviceQuery and deviceQueryDrv pass other CUDA programs fail CUDA Setup and Installation	3	1856	November 13, 2013
Problem with cudaGetDeviceCount returned 802 error Linux cuda	6	2640	December 28, 2024
cudaGetDeviceCount returned 100 -> no CUDA-capable device is detected CUDA Setup and Installation	0	1315	May 12, 2021
CUDA Device Query Error CUDA Programming and Performance	1	2566	June 4, 2012
CUDA 5 on CentOS 6.5: deviceQuery fails and cudaGetDeviceCount() returns wrong values CUDA Setup and Installation	11	3706	February 25, 2014
SDK example deviceQuery and deviceQueryDrv failure CUDA Programming and Performance	3	7830	February 23, 2018
Linux installation error: cudaGetDeviceCount returned 30 -> unknown error CUDA Setup and Installation	9	19503	November 4, 2021
Results of running "deviceQuery" on Amazon EC2 GPU Instance Output of running the command de CUDA Programming and Performance	0	12874	February 15, 2011
CUDA 3.2 issues on 15 in macbook pro 2010 (GT 330M) CUDA Programming and Performance	2	13782	April 13, 2011
cudaGetDeviceCount error 3 (cudaErrorInitializationError) CUDA Programming and Performance	4	3330	March 22, 2021

deviceQuery reports: cudaGetDeviceCount returned 10 -> invalid device ordinal / test results... F

Related topics