I have a machine with two HIC cards that allows the machine to access 16 GPUs. However, when I attempt to go above 8 GPUs, deviceQuery returns the following:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Device Count = 0
cudaGetDeviceCount returned 10
-> invalid device ordinal
Result = FAIL
In particular, cudaGetDeviceCount populates its argument with 0.
Here are my GPUs according to lspci when I try 10:
4c:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
4d:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
4e:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
53:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
55:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
62:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev ff)
63:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev ff)
65:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev ff)
69:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev ff)
6b:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev ff)
The OS is CentOS 6.4 64bit and the CUDA version is 5.5.
# ./deviceQueryDrv
./deviceQueryDrv Starting...
CUDA Device Query (Driver API) statically linked version
cuInit(0) returned 101
-> CUDA_ERROR_INVALID_DEVICE (device specified is not a valid CUDA device)
Result = FAIL
Thanks for your help ryluo. I just submitted an application for the CUDA/GPU Computing Registered Developer Program, so hopefully that goes through and then I will submit a bug report.
Recently I have built a Linux box with 16 NVIDIA GPUs, and all GPUs can be accessed without problems.
My CUDA version is also 5.5, with the 319.37 Linux 64-bit driver. So there should be no driver issue for more than 8 GPUs.
And I’m wondering if anyone has tested with more than 16 GPUs before? I’d like to try it when I get more dual-GPU cards and see if there are any issues with more than 16 GPUs.