GPUs are stuck when using multiple GPUs to train

Hi,

I bought 4 GPUs (three of RTX 8000 and one of Titan RTX) and 1 NV Link for RTX 8000
There is no issue when using only one GPU, but it is stuck when using multiple GPUs, even any two GPUs
If it is stuck, nvidia-smi doesn’t work so it should be rebooted (RAM memory doesn’t go up at the moment more than 10GB)

I have another server which has four GPUs of Titan RTX and it works well
I want to find out whether it is the issue of H/W or not, so please share your experiences and recommendations

Based on the combinations below, I had experimented several times to make them work, but it failed
Also, it is stuck for whatever it is tensorflow and pytorch

Environments
OS - Ubuntu 16.04 (64bit)
CPU - Ryzen Threadripper 3970X
GPU - three of RTX 8000, one of Titan RTX
256GB RAM

450.57 (run file) 450.80.02 (run file) 430.64 (run file) 430.64 (ubuntu-drivers autoinstall via http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub) 418.43 (run file)

Cuda versions
I used run file for all of them

10.0
10.1.105
10.2.89

cuDNN versions
I used run file for all of them

10.1-v7.6.5
10.1-v8.0.3
10.2-v7.6.5
10.2-v8.0.3

Regards,
Donghwan

“fail”, “stuck”, “doesn’t work” are not adequate problem descriptions. Fail how? Stuck how? Doesn’t work how? At minimum, please describe in more detail the following:

(1) GPUs are stuck
(2) nvidia-smi doesn’t work
(3) “RAM memory doesn’t go up at the moment more than 10GB”

What symptoms specifically are you observing? System doesn’t boot? System reboots spontaneously? No video output (screen stays black)? Error messages in system logs? What exactly is the issue with the (system?) RAM?

I note that this system has extremely high power requirements. The nominal power draw of a Quadro RTX 8000 is 295W, and it is 280W for a Titan X. The ThreadRipper 3970X CPU nominally draws 280W. The nominal wattage for the entire system with three RTX 8000 plus a Titan RTX is probably around 1600W. You would need PSUs (power supply units) with 2600W nominal wattage to run this rock solid under full load, and PSUs with about 2100W nominal to run this at all. What does the actual power supply situation look like? For a system like this I would definitely recommend 80PLUS Titanium rated PSUs.

This huge amount of power (basically, a space heater) used in a relatively small space also raises the question whether there is adequate cooling. What kind of enclosure (form factor) is used by the system? What does the cooling solution look like?

1 Like

Hi njuffa,

Thank you for your reply
I hope that these are enough to explain to you

Power consumption
First of all, I have the PSU which has the rated power of 2000W (80 PLUS Platinum) and the model name is SF-2000F14HP, it has the big tower case (width x height x depth is (30x70x70)) with a good cooling system
But I don’t understand why I need the PSU with 2600W nominal wattage, is it the average value of all PSUs for the certificate such as 80 PLUS platinum, gold, silver and bronze?
As you described, the recommended wattage for entire system is around 1700W (checked on outervision calculator)
Lastly, I have same server which has the different GPUs(four of Titan RTX) and Coolers (liquid cooling), and it works well under full load

Symptoms
I also use only command line interface (lightdm stop) and apply the persistent mode (nvidia-smi -pm 1)
To clarify, the main issue is that the GPUs are freezed and we can’t use the multiple GPUs to train the DNN models
Actually, (1) and (2) have the same symptom and happen at the same time
What I can only explain is that nvidia-smi doesn’t show when using multiple GPUs to train the models,
and I can’t kill the process, so I need to reboot if I want to use any GPUs again
*but the process, which I’ve run with a single GPU before, is going well
*please tell me if there is an useful way to get the log for this problem

About (3), I think I told you the useless information, so please ignore that

Regards,
Donghwan

Re PSU sizing. The nominal wattage for electronic components is typically stated as TDP (thermal design power) or something essentially equivalent. This is power draw averaged over long periods of time (across several minutes). It is needed for determining the correct dimensions of the thermal solutions, thus the name. TDP doesn’t tell us anything about instantaneous power requirements (across a few milliseconds), which with modern CPUs and GPUs can be significantly higher than nominal. In my experience, a large safety factor is therefore needed for rock solid 24/7/365 operation under possibly changing environmental factors such as ambient temperature, so my standing recommendation is to make sure that nominal power for the system does not significantly exceed 60% of the nominal PSU wattage.

I usually suggest 80PLUS Titanium for servers with large power draw because it

(1) gives the greatest power supply “head room”, which can be important in environments where power is limited by the amperage of the circuit breaker. E.g. a typical residential electrical outlet in the US cannot supply more than 15A @ 120V. The more efficient the PSU, the higher the percentage of that power actually available to the system. Are you running the 2000W Super Flower PSU off 230V mains?

(2) converts the least amount of power to useless heat before it ever gets to the system, reducing electricity costs. With a system like yours operating around the clock, you are looking at 50kWh per day, so assuming 90% PSU efficiency, the intra-PSU loss is 5kWh per day, or (depending on electrical tariff) about $1.20 to $1.50 per day where I live.

Re symptoms: The “GPU freeze” scenario is not clear yet. Here is a hypothetical scenario: You boot the system, run nvidia-smi and it happily reports it can see four GPUs, all idling. Then you start your deep learning software, and some minutes into that the GPUs stop making forward progress. At this point you run nvidia-smi again, and it cannot see any of the GPUs. You check dmesg or syslog or var/log/messages and see multiple error messages like NVRM: Xid (PCI:0000:06:00): 79, GPU has fallen off the bus. Is that what you are observing?

If so, the most likely root cause is inadequate power supply to the GPUs. There are only a few scenarios where a GPU can fall of the bus: (1) (poor quality) risers cards degrading PCIe signal quality (2) GPU not waking up correctly after suspend-resume cycle (3) PCIe link operation negatively impacted by ACPI (4) defective GPU (5) insufficient power supplied to the GPU. All but the last one are rare.

Lastly, I have same server which has the different GPUs(four of Titan RTX) and Coolers (liquid cooling), and it works well under full load

If both machine were configured identically in all aspects, I agree that both should work when one works. “In all aspects” means “copy exactly”, down to cabling, BIOS versions, etc. Clearly, you do not have that here as you have different GPUs in the two systems. They may have different requirements, e.g. for power and PCIe address space. One thing you can try is a cross check: exchange the GPUs between the working system and the problematic system and see what happens. I would also suggest you start with a small configuration with one GPU (which hopefully works) and work your way up to a four-GPU configuration.

You would want to work very methodically, carefully noting whether any issues correlate with a particular GPU, PCIe slot, or system.

1 Like

I checked out a review / test of the SF-2000F14HP. To their surprise, the testers managed to stably supply a system load of 2500W DC, causing the PSU to pull 2813W AC from the 230V mains, for an efficiency of 88.9%.

So it looks like your PSU may indeed have sufficient reserves to power this system. You might want to double check the cabling to ensure that all PCIe auxilliary power cables to the GPUs are connected correctly.