This is my code, when I use 8 GPUs, everything works fine, but when I use more than 8 GPUs , arises Cuda error 46.
Please don’t post pictures of code. post the actual code as text, and use the formatting tools to properly format code. Thanks.
What is the output of nvidia-smi
on that system?
Thanks for your reply. This is code and the output of nvidia-smi.
#include "iostream"
#include "stdio.h"
#include <thread>
using namespace std;
#define checkCUDA(status) { \
if (status != 0) { \
std::cout << "Cuda failure: " << status<<std::endl; \
} \
}
int main()
{
int nDev = 9;
int Devices[nDev] = {1,2,3,4,5,10,11,12,13};
float* src[nDev];
cudaStream_t stream[nDev];
cudaEvent_t event[nDev];
size_t size = 1024*1024;
std::thread t[nDev];
for(int i = 0; i < nDev; i++)
{
std::cout<<i<<std::endl;
checkCUDA(cudaSetDevice(Devices[i]));
checkCUDA(cudaStreamCreate(stream + i));
}
}
Fri Sep 17 22:56:48 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 PH402 SKU 200 Off | 00000000:1C:00.0 Off | N/A |
| N/A 38C P0 38W / 140W | 27419MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 PH402 SKU 200 Off | 00000000:1D:00.0 Off | N/A |
| N/A 33C P0 35W / 140W | 10MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 PH402 SKU 200 Off | 00000000:20:00.0 Off | N/A |
| N/A 44C P0 36W / 140W | 10MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 PH402 SKU 200 Off | 00000000:21:00.0 Off | N/A |
| N/A 38C P0 37W / 140W | 10MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 PH402 SKU 200 Off | 00000000:24:00.0 Off | N/A |
| N/A 45C P0 37W / 140W | 10MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 PH402 SKU 200 Off | 00000000:25:00.0 Off | N/A |
| N/A 39C P0 37W / 140W | 10MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 PH402 SKU 200 Off | 00000000:28:00.0 Off | N/A |
| N/A 44C P0 40W / 140W | 26499MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 PH402 SKU 200 Off | 00000000:29:00.0 Off | N/A |
| N/A 38C P0 37W / 140W | 25607MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 8 PH402 SKU 200 Off | 00000000:2C:00.0 Off | N/A |
| N/A 45C P0 38W / 140W | 25607MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 9 PH402 SKU 200 Off | 00000000:2D:00.0 Off | N/A |
| N/A 41C P0 38W / 140W | 25607MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 10 PH402 SKU 200 Off | 00000000:62:00.0 Off | N/A |
| N/A 37C P0 36W / 140W | 10MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 11 PH402 SKU 200 Off | 00000000:63:00.0 Off | N/A |
| N/A 36C P0 37W / 140W | 10MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 12 PH402 SKU 200 Off | 00000000:66:00.0 Off | N/A |
| N/A 43C P0 38W / 140W | 10MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 13 PH402 SKU 200 Off | 00000000:67:00.0 Off | N/A |
| N/A 38C P0 37W / 140W | 10MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 14 PH402 SKU 200 Off | 00000000:6A:00.0 Off | N/A |
| N/A 46C P0 37W / 140W | 10MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 15 PH402 SKU 200 Off | 00000000:6B:00.0 Off | N/A |
| N/A 39C P0 38W / 140W | 31093MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 16 PH402 SKU 200 Off | 00000000:6E:00.0 Off | N/A |
| N/A 44C P0 40W / 140W | 0MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 17 PH402 SKU 200 Off | 00000000:6F:00.0 Off | N/A |
| N/A 38C P0 37W / 140W | 0MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 18 PH402 SKU 200 Off | 00000000:72:00.0 Off | N/A |
| N/A 44C P0 39W / 140W | 0MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 19 PH402 SKU 200 Off | 00000000:73:00.0 Off | N/A |
| N/A 36C P0 37W / 140W | 0MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 57029 C python 27407MiB |
| 6 31358 C python3 26467MiB |
| 7 31358 C python3 25575MiB |
| 8 31358 C python3 25575MiB |
| 9 31358 C python3 25575MiB |
| 15 71417 C python 31083MiB |
+-----------------------------------------------------------------------------+
That’s a rather unusual GPU and system. The first thing I would do is update that system to the latest CUDA 11.4.2 release. If that does not resolve the issue, I would pursue it with your system vendor. There may be a SBIOS update needed, and/or special SBIOS settings might be needed to enable this configuration.
If anybody else is wondering: Apparently a “PH402 SKU 200” is a dual GPU combo comprising two P100s each with 32 GB. I am not aware that this ever was an official NVIDIA Tesla product? The name seems more indicative of a prototype that ultimately was not productized and thus is not supported by NVIDIA.
I recall past discussions in these forums with people successfully managing up to 14 GPUs in a system, but this system has 20. Each GPU will need a separate PCIe aperture. If the system BIOS does not support that many apertures, it might configure overlapping apertures for some of these GPUs causing failure? There may be some evidence in system logs of whatever the underlying issue is. Maybe try with seven “PH402 SKU200” (so 14 GPUs) first to see whether that configuration runs properly?
You can find datapoints indicating support for 20 GPUs in certain cases. For example, the mlperf v0.7 inference results in the datacenter-open category (be sure to click on “Open”) show submissions by both NVIDIA (on a SMC platform) and Inspur with 20 T4 GPUs. As already indicated, such a scenario requires support from a number of perspectives (NVIDIA GPU driver software, OS, and platform including SBIOS) so readers should not assume this is a trivial matter, automatic, or “guaranteed” in any way for an arbitrary system.