Issue with more than 8 GPUs

I have a machine with two HIC cards that allows the machine to access 16 GPUs. However, when I attempt to go above 8 GPUs, deviceQuery returns the following:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Device Count = 0
cudaGetDeviceCount returned 10
-> invalid device ordinal
Result = FAIL

In particular, cudaGetDeviceCount populates its argument with 0.

Here are my GPUs according to lspci when I try 10:

4c:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
4d:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
4e:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
53:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
55:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
62:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev ff)
63:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev ff)
65:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev ff)
69:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev ff)
6b:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev ff)

The OS is CentOS 6.4 64bit and the CUDA version is 5.5.

Any suggestions?

Hi,
Which driver did you install? Are you connecting S-series Tesla with HICs?

Could you run deviceQuery_drv against CUDA driver API?

I don’t know what you mean by an S-series Tesla.

# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  319.37  Wed Jul  3 17:08:50 PDT 2013
GCC version:  gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC)
# ./deviceQueryDrv 
./deviceQueryDrv Starting...

CUDA Device Query (Driver API) statically linked version 
cuInit(0) returned 101
-> CUDA_ERROR_INVALID_DEVICE (device specified is not a valid CUDA device)
Result = FAIL

How do you connect your GPU with HIC? AFAIK, HIC is connected to NVIDIA S series GPU, such as Tesla S1070, S2050.

I have a Dell C410x chassis and two NVIDIA P797 HIC cards. It works well with only 8 GPGPUs, but not any more.

Hi,
Could you please try 319.49 from Linux x64 (AMD64/EM64T) Display Driver | 319.49 | Linux 64-bit | NVIDIA? If the problem still persists, I suggest you file a bug to NVIDIA CUDA RDP.

Log in at https://developer.nvidia.com/user/login

Click link “CUDA/GPU Computing Registered Developer Program”
Click link “The Submit a Bug Form”

Thanks for your help ryluo. I just submitted an application for the CUDA/GPU Computing Registered Developer Program, so hopefully that goes through and then I will submit a bug report.

Recently I have built a Linux box with 16 NVIDIA GPUs, and all GPUs can be accessed without problems.
My CUDA version is also 5.5, with the 319.37 Linux 64-bit driver. So there should be no driver issue for more than 8 GPUs.
And I’m wondering if anyone has tested with more than 16 GPUs before? I’d like to try it when I get more dual-GPU cards and see if there are any issues with more than 16 GPUs.

update: 18 GPUs in a single rig has been proved to work! see https://devtalk.nvidia.com/default/topic/649542/cuda-setup-and-installation/18-gpus-in-a-single-rig-and-it-works/

deviceQuery result:

root@server:~# deviceQuery | head -n 4
deviceQuery Starting...

CUDA Device Query (Driver API) statically linked version
Detected 16 CUDA Capable device(s)

nvidia-smi result:

+------------------------------------------------------+
| NVIDIA-SMI 5.319.37   Driver Version: 319.37         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 660 Ti  On   | 0000:01:00.0     N/A |                  N/A |
| 30%   29C  N/A     N/A /  N/A |        7MB /  2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 295     On   | 0000:04:00.0     N/A |                  N/A |
| N/A   50C  N/A     N/A /  N/A |        3MB /   895MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 295     On   | 0000:05:00.0     N/A |                  N/A |
| 41%   48C  N/A     N/A /  N/A |        3MB /   895MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 295     On   | 0000:08:00.0     N/A |                  N/A |
| N/A   49C  N/A     N/A /  N/A |        3MB /   895MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 295     On   | 0000:09:00.0     N/A |                  N/A |
| 41%   48C  N/A     N/A /  N/A |        3MB /   895MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 660 Ti  On   | 0000:0A:00.0     N/A |                  N/A |
| 30%   31C  N/A     N/A /  N/A |        7MB /  2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 660 Ti  On   | 0000:0B:00.0     N/A |                  N/A |
| 30%   32C  N/A     N/A /  N/A |        7MB /  2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 295     On   | 0000:83:00.0     N/A |                  N/A |
| N/A   51C  N/A     N/A /  N/A |        3MB /   895MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   8  GeForce GTX 295     On   | 0000:84:00.0     N/A |                  N/A |
| 41%   49C  N/A     N/A /  N/A |        3MB /   895MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   9  GeForce GTX 660 Ti  On   | 0000:85:00.0     N/A |                  N/A |
| 30%   29C  N/A     N/A /  N/A |        7MB /  2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|  10  GeForce GTX 295     On   | 0000:88:00.0     N/A |                  N/A |
| N/A   54C  N/A     N/A /  N/A |        3MB /   895MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|  11  GeForce GTX 295     On   | 0000:89:00.0     N/A |                  N/A |
| 41%   52C  N/A     N/A /  N/A |        3MB /   895MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|  12  GeForce GTX 295     On   | 0000:8C:00.0     N/A |                  N/A |
| N/A   53C  N/A     N/A /  N/A |        3MB /   895MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|  13  GeForce GTX 295     On   | 0000:8D:00.0     N/A |                  N/A |
| 41%   51C  N/A     N/A /  N/A |        3MB /   895MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|  14  GeForce GTX 660 Ti  On   | 0000:8E:00.0     N/A |                  N/A |
| 30%   33C  N/A     N/A /  N/A |        7MB /  2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|  15  GeForce GTX 660 Ti  On   | 0000:8F:00.0     N/A |                  N/A |
| 30%   32C  N/A     N/A /  N/A |        7MB /  2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
|    1            Not Supported                                               |
|    2            Not Supported                                               |
|    3            Not Supported                                               |
|    4            Not Supported                                               |
|    5            Not Supported                                               |
|    6            Not Supported                                               |
|    7            Not Supported                                               |
|    8            Not Supported                                               |
|    9            Not Supported                                               |
|   10            Not Supported                                               |
|   11            Not Supported                                               |
|   12            Not Supported                                               |
|   13            Not Supported                                               |
|   14            Not Supported                                               |
|   15            Not Supported                                               |
+-----------------------------------------------------------------------------+