Need help for multi-GPU CUDA programming

Hello Everyone, hope someone may help me for the problem stated below.

OS: Windows 10 (64bit) IDE: Visual Studio

I got 4 GTX 1080Ti installed with deviceID: 0,1,2,3.

For a same set of CUDA code, I use cudaSetDevice(deviceID) to assign this job to different GPUs. GPU 0, 2 and 3 works perfectly. However, GPU with deviceID=1 always run into a problem shown in the terminal:

a long list of “CURAND_GENERATE() failed!”
then followed by
“CUFFT error: Plan creation failed CUFFT error: ExecC2C failed CUFFT error: Failed to synchronize”

Can someone tell me what happens to this GPU? Why, for same code, 0,2,3 work well. But not GPU 1.

I am really frustrating. This problem has troubled me for a long time. Based on my checking, everything on these 4 GPUs are same.

Thank you, all!

Power problem? How is the PSU specified? Single- or dual/multi-rail? How are the GPUs hooked up to the PSU?

Hello Tera,

Thank you for your reply. The PSU is 1600W. I think this is enough for 4GPU, right?

I don’t understand the singl-or dual/multi-rail you are mentioning. Each GPU is connected to PSU by wire with “VGA” name and the VGA wire is connected to port with “VGA” on the PSU.

I did check this GPU by using it to display on the monitor. It works fun, but why CUDA code can’t work on this GPU.

Best,

A marginal configuration, I would say. Rule of thumb: The total combined nominal wattage of all system components should be <= 60% of PSU rated output. Here, PSU = 1600W, 60% of which are 960 W. Combined nominal wattage of system components: 4x GTX 180Ti @ 250W + 100W…150W = 1100W…1150W, thus > 960 W.

Make sure each GPU is connected with 6-pin plus 8-pin PCIe power connectors, with no Y-splitters or 6-to-8-pin converters. PSU compliant with 80PLUS Platinum specification recommended.

Your problem may not be electrical, but rather an issue of PCIe slot or memory aperture configurations (check system BIOS) or mechanical (GPU not properly seated in slot, not secured at bracket).

If you swap the GPUs through the slots in cyclical fashion, is the problem correlated with a specific GPU or a specific slot?

have you disabled wddm timeout ?