I have been put to manage a small shared GPU server with 4 Titan Vs and a Tesla V100. One of my users complained of slow performance so I looked into it. Using a cifar-10 tutorial as a benchmark (https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py) I found that it was almost 60ms/step, compared to the 10ms/step for my personal GTX1080. I found the thing about unlocking the clock cap and applied that which got us to ~30-40ms/step, however, this is still unreasonably slow. I have tried running tests both in the docker containers that we are using for users, as well as on the bare-metal host. Running Ubuntu 18.04.1 LTS and driver 415.27 with CUDA 10 for reference.
Is this a system you bought from an integrator with GPUs already installed, or did you put this system together yourself? If the latter, what is the wattage of the power supply on this machine? The combined load from the GPUs alone is 1250W, whole system likely 1400+W. For rock solid operation, my rule of thumb is that total nominal power draw should not exceed 60% of the nominal power supply wattage, so ideally you would have 2400W provided by the power supply. If you are cutting it close, which I don’t recommend, 2000W.
Have you checked with nvidia-smi whether the GPUs run at the expected clock speeds under benchmark load?
Does nvidia-smi show thermal capping or power capping during the benchmark run (“Clocks Throttle Reasons”)? If power capping occurs, try raising the power cap limit to the maximum allowed (“Max Power Limit”). Note that this can lead to higher GPU power demands on the power supply; make sure there are adequate reserves per my rule of thumb above. If thermal capping occurs, your server enclosure provides inadequate cooling; fix that. You could also try raising the therm cap limit, but running electronics hot will reduce their life time.
I don’t know what cifar-10 tutorial is, or what its performance characteristics are. Does it require much CPU/GPU communication? Ideally, each GPU should use a PCIe gen3 x16 link for optimal speed. Given that there are 5 GPUs in this system, and most CPUs don’t offer more 40 PCIe lanes, I am guessing this may not the case. How are the PCIe links configured?
What kind of CPU is in this system, and how much system memory does it have? GPU-accelerated systems can also suffer from an underpowered CPU, since applications will have serial portions executing on the host CPU. Ideally, you want about 4 CPU cores per high-end GPU, with as high clock frequency and thus single-thread performance as you can get. As for system memory, you would want as many channels of DDR4-2666 as you can get, and system memory size is ideally around 4x the total amount of GPU memory (in this case, 256GB of system memory, for example).
Bought from an SI, got 3x 1600W if I remember correctly although one is for redundancy. I would question your rule of thumb, considering peak efficiency is usually found in the 70-80% range but either way, we’re well below.
As I said I have fixed the clock capping issue that is there by default, cards clock up no issue and maintain the proper power state without thermal throttling.
Everything is on PCI3x16, two xeons make that no problem. Deep learning applications, in general, don’t need much CPU-GPU communication after setup.
192GB of DDR4-2400 I believe, more than enough.
I really don’t think this is an external bottleneck, I am comparing to my personal machine which has less RAM running at the same speed, fewer cores (and never boosts above the xeon cores speeds), and a weaker GPU. It shows no signs of throttling, and it occurs regardless of which GPUs and what CPU affinity I set.
I don’t have any other ideas. Maybe look into the software configuration; something may be mis-configured their. Generally, trying to diagnose issues like this on a system I can’t touch with software I can’t run and don’t know is like a car mechanic trying to diagnose car trouble over the phone. Doesn’t work too well. My experience in this forum is that most people with weird issues have a home-brew system, which motivates my standard canon of questions. Various system integrators have in-house knowledge of configuring their systems for optimal deep-learning performance, so you may want to bring up the performance issue with your vendor.
This rule of thumb is not driven by efficiency, but by a tendency of CPUs and GPUs to experience short-term (millisecond) peaks in power consumption well beyond their design power (which is based on power draw averaged over longer periods of time, therefore for example TDP = thermal design power). Peak usage 25% over TDP is not unusual. If the PSU does not have reserves, this will cause undervoltage (brown outs) with possible outcomes including (1) the system’s PWRGOOD signal says “bad power” and the machine performs a hard reset (2) Processor components slowed down, leading to false data (e.g. due to violation of timing requirements of flip-flops).
My experience is: When running a system at 80% of PSU wattage, hard reboots frequently accompany the start of a CUDA application, as both CPUs and GPUs kick in. 60% is certainly conservative, but it will give you a rock-solid system, and PSUs have modest increase in price as wattage goes up. You may get by with pushing to 70% of nominal PSU wattage. I would recommend 80 PLUS Platinum rated PSUs for a server, and consider 80 PLUS Titanium ideal (but possible too pricy; it depends on usage pattern and local cost of electricity which is high where I live [California, 21ct/kWh]).
As for efficiency, there are differences between 115V operation in the US and 230V operation in Europe. I am in the US, and looking at published efficiency data for PSUs running off 115V, one notes that the region of maximum efficiency is typically between 15%-20% and 60%-65% of nominal wattage. With the higher input voltage, the region of maximum efficiency extends into higher percentage ranges. A random example showing the two efficiency curves side by side (for a 1600W PLUS 80 Titanium PSU): https://www.tomshardware.co.uk/evga-supernova-1600-t2-psu,review-34200-5.html