Hello everyone, I am currently looking to purchase a GPU workstation for high-performance computing (primarily FP32, with FP64 as a secondary consideration) for CFD applications. I am currently using the GV100, which has been performing very well. I want to buy a device with a newer architecture but am unsure which GPUs are best suited for HPC. Why is it that gaming cards like the 4090 can achieve higher single-precision floating-point computation speeds than some professional cards? I would appreciate any advice and insights on which parameters to consider when making a purchase.
The gaming cards often have higher temperature and higher clock frequencies. They are not meant for prolonged (24h) reliable (no ECC) operation. On the other hand they can be a good fit for developer workstations.
There are three computation speed parameters:
- FP32 performance (non-tensor core)
- Tensor Core performance with lower bit count (e.g. FP16 with FP16 accumulation)
- Tensor Core performance with higher bit count (e.g. TF32 or FP16 with FP32 accumulation)
- (FP64: your GV100 is better)
They are modified by SM count and (boost) clock frequency.
The memory has
- size
- bandwidth
- L2 size
For high-performance computing since Volta the Tensor Cores are even more important. Sometimes (depending on architecture generation) the professional cards have the same or better=2x or even 4x the tensor core performance or sometimes for tensor core instructions with FP32 accumulation). FP64 is quite slow on current consumer cards (and even on most of the professional cards. Exceptions A30, A100, H100 - you definitely should keep the GV100 for FP64; it also has large memory and good memory bandwidth).
So e.g. compare (similar price)
RTX 4090
RTX A5000
RTX 4500 Ada
Have a look here for Ampere vs. Ada performance:
You can see here that the consumer ADA vs. consumer Ampere performance only increased, because of the higher number of SMs and the higher boost clock. No architectural increase can be seen for the non-Tensor core or the tensor core performance. (Appendices A and B)
And the Tensor Core Performance/SM/clock even halved for professional vs. professional cards (Appendix C), compare to the non-tensor core performance to see the difference (controlling for SM count and boost frequency).
So depending on use, I would either go with the A5000 or the 4090.
- The A5000 has the professional features (e.g. setting clock speeds, ECC, more reliable, up to 4x tensor core speed-up/SM/clock, less power consumption/heat generation),
- the 4090 has overall double the SMs and 1.5x the clock speed, slightly more memory performance and a bit more features (8-bit floating point tensor core computations).
So (also considering SM count and clock frequency) for the higher-bit tensor core computations (TF32 or FP16 with FP32 accumulation) the A5000 is slightly faster, for the lower-bit tensor core computations, the 4090 is slightly faster. The 4090 has definitely more non-tensor core performance and slightly more memory bandwidth. All-in-all they are very similar cards performance-wise and have the same memory size of 24 GB. One other parameter: The 4090 has 72 MB of L2 cache vs. the 4 MB of the A5000. That could also make a difference on a lot of algorithms.
If you have a budget for 3x the price, you can also consider the L40, RTX 5880 Ada and RTX 6000 Ada, which all are comparable to the 4090 for base performance and definitely faster for Tensor Core computations and have more memory (48GB memory and 96 MB of L2).
Hello, I’d like to ask for your advice on a few more questions. My budget is almost finalized, and I’ll likely choose between the 4090, RTX A6000 and RTX 5000 Ada for building a dual-GPU workstation. In your opinion, which one would be better for HPC use? Also, for such a dual-GPU professional workstation, should I opt for a high-end CPU, like Intel’s Xeon series?
What would be the usage of your dual GPUs? Running the same kernels with different data, running different kernels? Implementing multi-GPU algorithms per se (e.g. for later porting to an even higher scaled system)? Do you want to use the additional memory of the additional computational performance of a second GPU?
About the main system: Are performance-critical parts of the algorithms run on CPU? Do you need special e.g. communications periphery, or a large amount of memory?
If the workstation is used for programming and debugging, you perhaps want to run C++/Python/Matlab code on CPUs for analyzing data or running a ground truth/gold standard algorithm for comparison.
For Cuda specifically, you would want to have enough PCIe lanes with the highest speed supported by the GPUs (e.g. PCIe 4.0x16), a very high CPU clock frequency to start the kernels with low latency.
For the other mentioned topics, a high memory bandwidth (many RAM channels).
One CPU would be often simpler (affinity) than two.
If you have the budget, there are mainboards with octa channel DDR5 and PCIe 5.0 for
- Intel CPUs (C741 or W790 chipsets with 4677 socket) or
- AMD CPUs (WRX90 or Epyc 9004 SoC for sTR5 or SP5 sockets).
But for GPU performance alone you would not need it.
Would you get two of those GPUs or is your second GPU the GV100?
Hello Curefab,
Thank you so much for taking the time to answer my questions. The dual GPUs are meant to run the same kernel because our computational fluid dynamics (CFD) tasks require a lot of memory, and the memory of a single GPU is far from enough. I’m planning to develop a multi-GPU algorithm, which is why I want to set up this dual-GPU workstation.
In my plan, only lightweight computations will be handled by the CPU, while the rest will be fully managed by the GPUs. However, the vendor mentioned that without a powerful CPU, the GPUs cannot run stably. I’m not sure how accurate this claim is.
I’m planning to buy two new GPUs, unrelated to the GV100. I’m choosing between the 4090, RTX A6000, and RTX 5000 ada.
For a multi-GPU setup I would recommend GPUs supporting NVLink. The ADA GPUs do not support NVLink (neither 4090 nor workstation variants) and would need PCIe to exchange data. [wrong: However, the 4090 does support PCIe 5.0; it also supports only 4.0.] The ADA workstation variants only PCIe 4.0, which makes them slower for a multi-GPU without NVLink support.
You also wrote that large memory size is important.
All options in this price range with higher memory and NVLink support:
- 2xRTX 8000: 2x48 GB, NVLink 50 GB/s per direction
- 2xA40: 2x48 GB: NVLink 56 GB/s per direction
- 2xA6000: 2x48 GB, NVLink 56 GB/s per direction
As the Ampere cards have double the speed (Tensor core and non-Tensor core) I would either take the A40 or A6000. They are quite the same.
Or the RTX 4090 using PCIe for communication between the GPUs
- 4x4090: 4x24 GB, [wrong: PCIe 5.0x16 63 GB/s] PCIe 4.0x16 31.5 GB/s per direction (less memory per GPU; consumer instead of professional, but large L2 cache)
I would choose a workstation which has ample of RAM and enough RAM bandwidth to fill the PCIe to and from the GPUs, but more modest CPU speed. The argument about stability makes no sense.
You probably would use as mainboard one of
PCIe 5.0 & DDR5
- Intel 4677 socket (C741 or W790 chipset): 4xDDR5(/8xDDR5)
- (AMD SP5 socket (Epyc 8004 SoC): 12xDDR5 → CPUs expensive)
- AMD SP6 socket (Epyc 9004 SoC): 6xDDR5 (only up to 3xPCIe 5.0x16)
- (AMD sTR5 socket (TRX50 chipset): 4xDDR5 → CPUs expensive)
PCIe 4.0 & DDR4
- Intel 4189 socket (C621A chipset): 8xDDR4
- AMD SP3 socket (Epyc 7001/2/3 SoC): 8xDDR4
- AMD sWRX8 socket (WRX80 chipset): 8xDDR4
- (AMD sTRX4 socket (TRX40 chipset): 4xDDR4 → CPUs expensive)
Affordable CPUs:
- 4677: some CPUs (Raptor Cove or Golden Cove)
- SP6: AMD Epyc 8024P (Zen 4c)
- 4189-4: Intel Xeon Silver 4310 (Sunny Cove)
- SP3: many CPUs (Zen 2 or Zen 3)
- sWRX8: AMD Ryzen Threadripper PRO 3945WX (Zen 2)
A very modern powerful system could be based on the ASRock Rack SPC741D8 mainboard with 4xPCIe 5.0x16 sockets and quad channel DDR5 memory.
Instead, with the PCIe 4.0 octa channel DDR4 boards (especially the AMD ones) you can create a still very good board with lots (e.g. 5x) PCIe 4.0x16 slots for a slightly more affordable price. Better invest the difference into GPUs.
4xDDR5 or 8xDDR4 has similar speeds in the end.
[wrong: The 4090 should have the PCIe 5.0 variant,] the other GPUs can have either variant.
PS
You could also consider multiple A30 GPUs:
- They have the GA100 chip, providing NVLink with 300 GB/s per direction for really fast multi-GPU setups
- They only have 24GB memory per GPU
- The non-Tensor core FP32 for GA100 is in theory half of GA102-based GPUs, in practice the numbers are closer together (the GA100 has dedicated FP32 and INT32 cores; the GA102 has mixed FP32/INT32 and dedicated FP32 cores; so the theoretical FP32 number of the GA102 is doubled)
- Otherwise it is a very fast card including serious FP64 performance on non-tensor core and on tensor cores
- simpler to connect for two GPUs (NVIDIA A100 NVLink 2-Slot Bridge Bundle | A15976), more complicated (large rack switch needed) for more than 2 GPUs
If the other GPUs are connected via NVLink using PCIe 3.0, would their performance be unaffected, or would it still be necessary to use at least PCIe 4.0?
If the GPUs are connected via NVLink, they can communicate (= accessing each other’s memory not only by copy, or by automatically moving memory with managed memory, but also by a true p2p peer-to-peer access) in a fast way (accessing data from other cards over NVLink is still slower (less bandwidth and probably higher latency) than “local” access from within one GPU).
The PCIe speed and main system RAM bandwidth would only be needed then to transfer data between CPU and GPU. For very large models you perhaps need to keep all data on the main system RAM.
But if, for example, 96 GB (=2x 48 GB) is enough, then the GPUs can communicate with each other over NVLink and PCIe bandwidth would not matter, except for the initial data transfer and for the results.
NVLink also has version numbers (see e.g. NVLink - Wikipedia), but it is not directly related to the PCIe version numbers. Also the number of lanes per sub-link and the number of sub-links varies between devices, whereas for PCIe it is typically up till 16 lanes.
According to the white paper, 4090 is only PCIe 4.0.
Also of note, P2P is not officially supported between 4090’s, unlike the workstation cards - RTX 5000 ada etc.
There is an open source driver for the 4090, that adds P2P to the 4090.
I’m possibly misunderstanding something here, but using the RTX A6000 as an example, theoretical bidirectional PCIe 4.0 bandwidth is around 64GB/s and NVLink is 112GB/s.
Not sure, why I got this wrong. Yes, the 4090 seems to be (is) only connected with PCIe 4.0, too (as the other Ada Lovelace cards).
So not really many reasons any longer (price, L2 cache size, clock frequency, SM count) to prefer it to the Ampere cards (additional NVLink, same non-tensor core and tensor-core performance, same PCIe 4.0).
The memory size and bandwidth quickly eats up any computational advantage or limits the model resolution.
Other cards (not recommended):
- The A16 with 64 GB is like 4 graphics cards with 16 GB each. Both AFAIK no NVLink, the 4 dies communicate by PCIe with each other.
So the 48 GB cards are the best you can get (below A100/H100 prices, the version with 80GB).
(Theoretically you could put together 3 SXM2 (not PCIe) Quadro GV100 cards 32 GB, which support NVLink, with one NVSwitch for Volta. This combination would also provide 96 GB GPU memory, but with an P2P access speed of 150 GB/s per direction. That would be a small DGX-2.
Also generally with NVSwitches, one could look for the most affordable Nvidia card still supporting NVLink and distribute the memory on more cards, then the bandwidth needed per card gets lower.)
Actually it is quite sad that the newest Nvidia GPUs to recommend for a dual GPU-setup (RTX A6000 and A40) were launched a bit more than 4 years ago now.
Thank you for checking, I have always stated the bandwidth per direction above. Which is 31.5 GB/s for PCIe 4.0 and 56 GB/s for NVLink.
Yes, I was misinterpreting, “it is still with less bandwidth than local access” as meaning PCIe access rather than intra GPU access…
Tried to make that sentence clearer.
How about the following build?
Mainboard and CPU with 7xPCIe 4.0x16, e.g.
- ASRock Rack WRX80D8-2T with AMD Ryzen Threadripper PRO 3945WX, or
- ASRock Rack ROMED8-2T or -2T/BCM with a CPU of the Epyc 7002 or 7003 series
7xRTX 4000 Ada Generation 20GB
Each one by itself would be weak, but in sum they have 140 GB, 336 MB L2, 2520 GB/s memory bandwidth, 187.11 Single Precision TFLOP/s, and also the PCIe 4.0x16 bandwidth would sum up to 220.5 GB/s per direction (as for each access one GPU has to receive, one GPU has to transmit).
Compared to 2xRTX A6000: 96 GB, 12 MB L2, 1536 GB/s memory bandwidth, 77.42 Single Precision TFLOP/s, 112 GB/s per direction NVLink (2 GPUs x 56 GB/s per direction; but less needed as more data is locally on each GPU).
The RTX 4000 Ada Generation is a single-slot card, so they would fit beside each other.
Alternative: 1 more powerful card for single-GPU and 6xRTX 4000 Ada for multi-GPU, high memory demand.
Thank you very much for your help, Curefab. The initial configuration the vendor offered me was a Dell Precision T7920 with an Intel Xeon Gold 6154. After hearing your suggestions, I’ve decided to negotiate with the vendor. I will also seriously consider your advice about six RTX 4000 Ada cards; you’ve opened up new possibilities for me, as I was limited in my thinking to a dual-card setup.
You are welcome. It is always interesting for me to update my previous experience with current offerings, also for choosing our next setups.
If I may, just a quick feedback on the CPU and the dual vs. multi-card setup:
Normally it can be a good thing to get a vendor-certified system. To have a reliable system setup. On the other hand it should be fit for your specific needs.
The Xeon Gold 6154 is an 18-core CPU from 2017.
It has 48 PCIe 3.0 lanes by itself, so
- would slow down Ampere and Ada GPUs (which are PCIe 4.0) and
- only be able to handle 2 GPUs natively by the CPU (as some lanes are used for M.2 or other interfaces), for more GPUs the mainboard would have to provide PCIe switching circuits, which could further lower PCIe peer-2-peer bandwidth, if some GPUs are connected to different PCIe switches and bridges. With a multi-GPU setup you want to have full PCIe switching speed between all GPUs at the same time. Depending on the PCIe architecture on the mainboard more than two PCIe cards can be okay, but you have to specifically look into it how the lanes are routed and switched.
- 3647 socket mainboards offering more than two PCIe 3.0x16 slots typically are dual-CPU mainboards. You probably want to avoid those, if you use PCIe extensively (Ada generation or Turing/Ampere without NVLink or more than two cards), as you could have slow downs between the PCIe lanes from either half of the system
So either for
- a Turing generation setup in general (more than dual, you have to watch out for the PCIe hierarchy) or
- an Ampere-dual GPU setup with NVLink (not caring for PCIe speed)
the CPU could be okay; however, for just being okay, the CPU is typically a bit too expensive IMHO.
Preferring dual-GPU vs. multi-GPU setup depends on your CFD algorithms:
Are they computation or latency-bound? How much data has to be exchanged between different GPUs? How often? How globally or how locally? Probably quite locally for a finite-element setup.
E.g. it is possible to either
- have a hard boundary between GPUs and store and calculate each point either on one or the other GPU
- have overlapping boundaries; compute the boundary region on both GPUs for some iterations at a time (until data further away is needed and has to be synchronized)
The second approach has a small bit more computation and memory demands, but allows to locally compute several iterations before synchronizing or moving data P2P.
With a dual-GPU setup you have more data locally on the same GPU, with a multi-GPU setup you have more boundaries. But if the algorithms have to handle those anyway (compared to a single GPU and even on a single GPU you have different SMs with each their own L1 cache and shared memory), then the additional memory and computational performance of a multi-card setup could be an advantage.
Would be interesting to know, which system you decided in the end.
Hello, Curefab. The final setup hasn’t been decided yet, as I’ve provided two proposals to the vendor for pricing, and the final decision will depend on my leader’s approval. I will definitely inform you once the final plan is decided.
My first proposal is two RTX A6000 GPUs connected via NVLink.
The second proposal is eight RTX 4000 Ada GPUs connected via PCIe 4.0.
Since our algorithms may involve GPU-CPU interaction, I specifically asked the vendor to ensure the communication protocol supports at least PCIe 4.0, regardless of whether NVLink is used.
Regarding CFD, I fully agree that working with eight GPUs will certainly be more challenging than with two, as there will be more boundary information to manage. However, I believe this is a necessary step. The purpose of this workstation is primarily to develop and debug algorithms. For simulating large-scale numerical problems, these devices would be far from sufficient, and we would need a server with eight A100 GPUs to handle those tasks. That would be something to consider after the algorithm is fully developed, potentially by renting server time!
There was also an unexpected twist— the vendor claimed that the RTX A6000 is not suitable for scientific computing. Now, I seriously doubt their expertise!
For using the algorithm to eight A100 GPUs the dual- or multi-card setup is a good base to develop and debug.
The one feature, the RTX A6000 (and many other workstation GPUs) is missing, is good FP64 performance.
There are approaches (also specifically with Cuda) to store two FP32 values instead and use clever FP32 computations to simulate FP64 accuracy.
Also the RTX A6000 has some FP64 performance. It may not be great, but it can be used.
Otherwise I would not know any reason, why the RTX A6000 should not be suitable for scientific computing.
It is a workstation GPU meant for continuous usage, it has 48 GB of ECC memory.
With 8 GPUs over PCIe 4.0, they would load data with around 252 GB/s max.
An DDR4-2666 octa channel system has around 170 GB/s max memory speed.
So you should either accept a bit of slow-down for GPU-CPU memory transfers or need to keep an eye on the available CPU memory bandwidth, too (mostly limited by the CPU).
E.g. the Epyc 9124 CPU has 461 GB/s of memory bandwidth.
Just as warning: It will be a bit difficult to find a single-CPU mainboard with 16 lanes of PCIe for 8 GPUs (more likely 5, 6 or 7 GPUs or not all GPUs get 16 lanes); 8 GPUs would probably be easier to partition, but let’s see, what the vendor offers.
The vendor has provided the following configurations:
Option 1:
- Dell Precision 7960 Tower Workstation
- CPU: Intel® Xeon® W5-3425 processor (12 cores, 24 threads, 3.2-4.6GHz)
- GPU: NVIDIA RTX A6000-48G (NVLink bridge)
Option 2:
- Dell Precision 5860 Tower Workstation
- CPU: Intel® Xeon® W3-2435 processor (8 cores, 16 threads, 3.4-4.5GHz)
- GPU: NVIDIA RTX A6000-48G (NVLink bridge)
Option 3:
- Dell Precision 7960 Tower Workstation
- CPU: Intel® Xeon® W5-3425 processor (12 cores, 24 threads, 3.2-4.6GHz)
- GPU: NVIDIA RTX 4000 Ada -20G (4 cards)
Unfortunately, I cannot share the total price. Based on the exchange rate, Option 1 is about $1,400 more expensive than Option 2, and Option 2 is about $5,600 more expensive than Option 3.
The vendor mentioned that they can only install 4 RTX 4000 Ada cards. I think this works as well, as it gives us a range of pricing options to choose from.