Hardware setup for a multi-GPU RTX 30xx -- based system

Hello CUDA experts,

I’d like to brainstorm a cost-efficient do-it-yourself setup for a multi-GPU RTX-30xx system (motherboard, CPU, RAM, power, etc). Is this the appropriate forum for such a discussion? If not, where does this discussion belong, any suggestions?

Thanks!

Given it’s early days, for those fortunate enough to even be able to get hold of RTX30XX hardware, there probably isn’t widespread experience yet.

One worthwhile place to get some idea of working hardware is Puget Systems:

There is a wealth of real world testing, with hardware used, there. In particular, check out this one:

as you say you want to use multiple cards - PSU and cooling are critical and it seems Gigabyte one of the few offering blower style cards that exhaust hot air directly out of the case.

The HPC blog may be worth checking out also:

I have no experience with Puget and am not trying to push any barrows here.

My assumption is that you’ll be building a system primarily for heavy compute workloads that you intend to use for five years. Here are some common issues when configuring a high-end GPU-accelerated system. This is generic advice, I haven’t used any Ampere-class GPU.

  • Underpowered CPU. GPUs help you accelerate the parallelizable portion of a workload. Assuming a well-parallelizable task, high-end GPUs speed up that part a lot. That often turns the serial portion, running on the CPU, into a bottleneck. You want a GPU with high single-thread performance, which to first order means a CPU with high clock frequency. I strongly recommend >= 3.5 GHz base frequency for the CPU. For the vast majority of use cases, no more than 4 CPU cores per high-end GPU are needed for a well-balanced system.

  • Insufficient system memory. A GPU-accelerated system often needs more system memory than an unaccelerated one, as data for the GPU is buffered on the host side. A good balance usually exists when the system memory is 2x to 4x the total GPU memory (in practical terms, i.e. available DRAM densities, it is easier to achieve 4x on smaller workstation-class systems and 2x on larger server-like systems). Speed-wise, you want at least DDR4-2666, and as many channels of that as you can afford. I would suggest >= four channels. There is no need for DRAM running at insanely high speeds with heat sinks and LEDs. Whether to use ECC or not is a matter of personal preferences. I always use it in my personal machines, but based on my experience running 24/7/365 you are not likely to encounter more than one soft error per year per 64 GB of memory, unless maybe if you operate at very high altitudes or in an environment with (naturally) elevated levels of radioactivity.

  • Insufficient power supply. Electronic components are typically listed with their TDP (thermal design power) or something essentially equivalent as nominal power. This is the power consumption averaged over longer periods of time, e.g. 5 minutes, and needed by system integrators (which might be you, if you assemble the system yourself) to appropriately size cooling solutions. Modern high-end CPUs and GPUs have dynamic power and clock management, which can lead to sudden spikes in instantaneous power, e.g. across 10 millisecond. These spikes can exceed TDP by significant amounts. If the power supply unit (PSU) cannot keep up, localized voltage drops (“brown outs”) can occur which slow down the switching speed of transistors, which can lead to system component malfunction. In particular the dreaded “GPU fell off the bus” error.

    For a rock-solid system across a wide variety of system loads and environmental conditions (in particular, ambient temperature), my standing recommendation is to have the sum of the nominal power draw of all system components not significantly exceed 60% of the nominal PSU power. Some people object to this rule of thumb as overly conservative, so if you are more adventurous, you might instead shoot for 70%. For a high-end workstation, I would suggest an 80PLUS Platinum rated PSU, and an 80PLUS Titanium rated one for a large server. These 80PLUS classes have high efficiency, so more of the juice that comes out of the wall plug (which may well be limited by circuit breaker amperage in the US: 15A at 120V max for most residential circuits) is available to the system. Plus you save on electricity costs. Component and build quality is usually superior as well, with corresponding long warranty periods. PSUs and DRAM (in that order) are the system components most likely to physically fail over time in my experience.

  • Cooling can be tricky with multiple high-end GPUs placed in close proximity. You are essentially building half of a space heater. Device temperature is one input to the dynamic clock management of the GPU: the hotter a GPU is running, the slower its clocks will be set. Eventually you might run into thermal throttling. I can see my GPUs running at anywhere from about 1300 MHz to 1700 MHz for the same workload depending on ambient temperature (50 deg F to 85 deg F), so the resulting performance difference can be non-trivial. So make sure the GPUs will run as cool as possible. I am not an expert on the details of cooling solutions, maybe someone else can chime in with relevant advice.

Thanks Rs277 and Njuffa for your thoughts!

I’ve seen some articles by Puget, which to some extent inspired my post. I agree: they’re quite educational.

Returning back to my original point, I’m not sure whether this forum is the proper place for the discussion of performant multi-GPU rigs. Let’s see if there’s much interest and then I’ll be happy to put my thoughts in to continue the conversation.

Best I can tell, Puget Systems did not use compute workloads for their quad RTX 3090 system, so I do not think the power consumption they observed adequately reflects what one would see with various deep-learning codes, for example.

While a standard residential circuit in the US (15A, 120V) can theoretically supply 1800W, a 15A circuit breaker will likely trip after a few minutes of applying that maximum load continuously. In addition, if I understand US electrical code correctly, a single plug-connected device shall not pull more than 80% of the maximum current. That would be 12A at 120V. Based on that, running a 1720W load from a standard electrical outlet would not look like a good idea. In addition, even a 1600W 80PLUS Titanium rated PSU will be right at the specification limit of operation under those circumstances.

It is one thing to find that something happens to work for a limited amount of time using a few selected workloads with brand-new hardware. It is a different thing to achieve long-term reliable operation across a wider universe of computational loads, under varying environmental conditions, for aging hardware.

BTW, I forgot to mention that when projecting the nominal power budget for the system, a good rule of thumb for DDR4 memory is to assume 0.4W per GB.

Thanks. Fixed. :-)

Under the “HPC Blog” link above, Tensorflow was used to good effect on a standard 15A circuit:

Overall though, he does not recommend it for long term use.

For an experiment, this is fine. But in real life I would not want to shell out really big bucks for a beefy top-of-the line GPU, only to then put it on a 280W “starvation diet” so as not to trip circuit breakers. Going with a middle of the line model and running that flat out (use nvidia-smi to set the power limit to the maximum allowed and cool aggressively) would appear to be the more sensible option in terms of bang for the buck.

Mostly I am just writing about power considerations to spare people the scenario that has been related in these forums too often now: “30 minutes into my training run, my GPU ‘fell of the bus’. What could be the reason?”

Fully understand the sentiment. The last graph in the article is interesting though, in as much as watt/result rapidly diminishing return - the last 2% of performance costs 200W across the four cards.

I don’t doubt the numbers, even though the example seems extreme. Generally this holds true because it is driven by basic physics.

For a CMOS circuit dynamic power grows linearly with operating frequency but with the square of the supply voltage. The GPU will try to boost clocks increasing the frequency. In order to do that, it needs to increase voltage as well to guarantee reliable operation. For my Quadro RTX 4000 for example, I observe that voltage increases roughly from 0.737V at 1395 MHz to 1.012V at 1890 MHz, i.e. almost exactly the expected linear relationship.

As the clock increases by factor x, dynamic power consumption is expected to increase by factor x3. For a particular workload this GPU draws 84W at 1395 MHz, and since the power limit is 125W, it could not run continuously at more than 1600 MHz with this workload (x=1.14). In practice it is almost always limited by thermal throttling. Note that I am not modelling the power draw of the on-board memory, the PCIe interface, and the cooling fan here, so this is a very rough model.

Speaking of the power draw, do we know if there’s a difference between power needs of tensor calculations vs those of the cuda code that doesn’t utilize tensor cores? When running my cuda code, that doesn’t utilize tensor cores and is light on GPU memory throughput, the fully loaded GPU is consuming only about 80% of its power limit, and about a third of that limit on older GPUs. This is opposed to the power-hungry TensorFlow experiments published by Puget and referenced above, in which I presume they’ve been utilizing tensor cores heavily.

In my somewhat dated experience (I have not run relevant experiments in a number of years), GPU power draw reaches a maximum for a particular mix of computational core and GPU memory activity, and the nominal (TDP) limit cannot be reached with just core activity alone. This is not just due to the power draw memory chips itself, but also due to the switching activity of the beefy transistors needed in the memory interface of the GPU, and the activity of memory-related on-chip structures such as caches and TLBs. I seem to recall that maximum power draw required GPU memory activity of around 1/3 of maximum throughput, but my memory is very hazy.

It should be noted that not every FADD, FMUL, FMA operation requires the same amount of energy, but that the data being processed also makes a difference due to various amounts of transistors switching. I don’t have any insights into the internal structure of the CUDA cores, though, so for GPU power maximizer programs I have not been able to do better than using randomly generated operands.