Is there a limit on the maximum number of GPUs?

This is my code, when I use 8 GPUs, everything works fine, but when I use more than 8 GPUs , arises Cuda error 46.

Please don’t post pictures of code. post the actual code as text, and use the formatting tools to properly format code. Thanks.

What is the output of nvidia-smi on that system?

Thanks for your reply. This is code and the output of nvidia-smi.

#include "iostream"
#include "stdio.h"
#include <thread>

using namespace std;

#define checkCUDA(status)  {                                         \
    if (status != 0) {                                                 \
      std::cout << "Cuda failure: " << status<<std::endl;                            \
    }                                                                  \
}


int main()
{
    int nDev = 9;
    int Devices[nDev] = {1,2,3,4,5,10,11,12,13};

    float* src[nDev];
    cudaStream_t stream[nDev];
    cudaEvent_t event[nDev];
    size_t size = 1024*1024;

    std::thread t[nDev];
    for(int i = 0; i < nDev; i++)
    {
        std::cout<<i<<std::endl;
        checkCUDA(cudaSetDevice(Devices[i]));
        checkCUDA(cudaStreamCreate(stream + i));

    }

}
Fri Sep 17 22:56:48 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  PH402 SKU 200       Off  | 00000000:1C:00.0 Off |                  N/A |
| N/A   38C    P0    38W / 140W |  27419MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  PH402 SKU 200       Off  | 00000000:1D:00.0 Off |                  N/A |
| N/A   33C    P0    35W / 140W |     10MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  PH402 SKU 200       Off  | 00000000:20:00.0 Off |                  N/A |
| N/A   44C    P0    36W / 140W |     10MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  PH402 SKU 200       Off  | 00000000:21:00.0 Off |                  N/A |
| N/A   38C    P0    37W / 140W |     10MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  PH402 SKU 200       Off  | 00000000:24:00.0 Off |                  N/A |
| N/A   45C    P0    37W / 140W |     10MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  PH402 SKU 200       Off  | 00000000:25:00.0 Off |                  N/A |
| N/A   39C    P0    37W / 140W |     10MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  PH402 SKU 200       Off  | 00000000:28:00.0 Off |                  N/A |
| N/A   44C    P0    40W / 140W |  26499MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  PH402 SKU 200       Off  | 00000000:29:00.0 Off |                  N/A |
| N/A   38C    P0    37W / 140W |  25607MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   8  PH402 SKU 200       Off  | 00000000:2C:00.0 Off |                  N/A |
| N/A   45C    P0    38W / 140W |  25607MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   9  PH402 SKU 200       Off  | 00000000:2D:00.0 Off |                  N/A |
| N/A   41C    P0    38W / 140W |  25607MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  10  PH402 SKU 200       Off  | 00000000:62:00.0 Off |                  N/A |
| N/A   37C    P0    36W / 140W |     10MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  11  PH402 SKU 200       Off  | 00000000:63:00.0 Off |                  N/A |
| N/A   36C    P0    37W / 140W |     10MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  12  PH402 SKU 200       Off  | 00000000:66:00.0 Off |                  N/A |
| N/A   43C    P0    38W / 140W |     10MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  13  PH402 SKU 200       Off  | 00000000:67:00.0 Off |                  N/A |
| N/A   38C    P0    37W / 140W |     10MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  14  PH402 SKU 200       Off  | 00000000:6A:00.0 Off |                  N/A |
| N/A   46C    P0    37W / 140W |     10MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  15  PH402 SKU 200       Off  | 00000000:6B:00.0 Off |                  N/A |
| N/A   39C    P0    38W / 140W |  31093MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  16  PH402 SKU 200       Off  | 00000000:6E:00.0 Off |                  N/A |
| N/A   44C    P0    40W / 140W |      0MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  17  PH402 SKU 200       Off  | 00000000:6F:00.0 Off |                  N/A |
| N/A   38C    P0    37W / 140W |      0MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  18  PH402 SKU 200       Off  | 00000000:72:00.0 Off |                  N/A |
| N/A   44C    P0    39W / 140W |      0MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  19  PH402 SKU 200       Off  | 00000000:73:00.0 Off |                  N/A |
| N/A   36C    P0    37W / 140W |      0MiB / 32630MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     57029      C   python                                     27407MiB |
|    6     31358      C   python3                                    26467MiB |
|    7     31358      C   python3                                    25575MiB |
|    8     31358      C   python3                                    25575MiB |
|    9     31358      C   python3                                    25575MiB |
|   15     71417      C   python                                     31083MiB |
+-----------------------------------------------------------------------------+

That’s a rather unusual GPU and system. The first thing I would do is update that system to the latest CUDA 11.4.2 release. If that does not resolve the issue, I would pursue it with your system vendor. There may be a SBIOS update needed, and/or special SBIOS settings might be needed to enable this configuration.

If anybody else is wondering: Apparently a “PH402 SKU 200” is a dual GPU combo comprising two P100s each with 32 GB. I am not aware that this ever was an official NVIDIA Tesla product? The name seems more indicative of a prototype that ultimately was not productized and thus is not supported by NVIDIA.

I recall past discussions in these forums with people successfully managing up to 14 GPUs in a system, but this system has 20. Each GPU will need a separate PCIe aperture. If the system BIOS does not support that many apertures, it might configure overlapping apertures for some of these GPUs causing failure? There may be some evidence in system logs of whatever the underlying issue is. Maybe try with seven “PH402 SKU200” (so 14 GPUs) first to see whether that configuration runs properly?

1 Like

You can find datapoints indicating support for 20 GPUs in certain cases. For example, the mlperf v0.7 inference results in the datacenter-open category (be sure to click on “Open”) show submissions by both NVIDIA (on a SMC platform) and Inspur with 20 T4 GPUs. As already indicated, such a scenario requires support from a number of perspectives (NVIDIA GPU driver software, OS, and platform including SBIOS) so readers should not assume this is a trivial matter, automatic, or “guaranteed” in any way for an arbitrary system.