GPU for physics simulations

Our research group is shifting to using GPUs for various physics simulations mostly written in C++. Most of our code utilises 64 bit floating point. We parallelise our code using OpenACC. So, during initial testing, we found that our codes run faster on A100 which has 6912 CUDA cores compared to an RTX3090 which has 10492 CUDA cores. After some digging, it seems like the “speed” of execution for such stuff depends on FP64 FLOPS which is much much higher for A100 GPU even though it has lesser number of CUDA cores. Is this all there is to it? As in, while buying further GPUs for this specific purpose of physics simulations, should we keep an eye out only for the FP64 FLOPS value…or should something else be considered? For example, memory type HBM2 vs GDDR6 or number of CUDA cores and so on?

A100 FP64: 9.75 TFLOPS
RTX 3090 FP64: 0.56 TFLOPS

All consumer cards are purposefully limited in their FP64 throughput, but may have other features not present in the professional HPC line, such as ray-tracing units.

For HPC computations it is frequently the case that memory bandwidth is a more significant limiter than FP64 throughput. You would have to do roofline analysis of your codes to see which factor limits performance first. Examples: FFTs are typically limited by memory throughput, dense matrix-matrix multiplies are typically limited by compute unit throughput.

Some important real-life applications require significant amounts of data to be resident in GPU memory to run fast, so GPU memory capacity may be something to consider. Compared to the system memory bandwidth of an HPC system (200 GB/sec) and the memory bandwidth of an HPC GPU (1-2TB/sec), the PCIe interconnect between them has significantly lower throughput (e.g. PCIe4 x16 about 25 GB/sec per direction; it’s full duplex).

CUDA cores are a marketing term. Basically what is being counted there is how many FP32 FMA units are on the chip; just look at FP32 throughput instead.

To first (and possibly even second) order, HBM2 vs GDDR6 should be transparent to CUDA apps in term of performance. Just look at memory throughput (note that any theoretical memory bandwidth numbers written up in specs need to be de-rated by a factor of 0.85 to get a good idea of practically achievable bandwidth; that is also true for the host system’s system memory).

Consider whether your applications require GPUs with ECC (this is SECED capability: single error correct, double error detect).

1 Like

Our codes will not use ray tracing - so we can count those out.

Most of our operations in the compute intensive parallel loops are trig functions, exponentiations and array operations. Our data is such that they do not “generally” go beyond 10-20 GBs, so memory capacity is not that big of an issue but we sometimes use larger models.

Thanks for pointing out that CUDA cores are essentially FP32 FMA units, I did not know that.

So, if I have to order some GPU hardware, would it be better to get a previous generation HPC GPU like the V100 (since A100s are quite pricey) compared to current gen or prev gen RTX cards like 3090s or 4090s - since the FP64 throughput on V100 (or even P100) is still far more than these RTX cards?

I have not been involved in HPC purchasing decisions, so I am not in a position to offer much advice. The one factor you should consider is deprecation of software support over time. If I am informed correctly, the recently released CUDA 12 removed support for all of compute capability 3.x (so Tesla K40 and Tesla K80).

I would assume that support for the Pascal architecture (compute capability 6.x) is going to be around for another two years or so, but I have no specific insights into that. HPC GPU purchases should (in my thinking) have a useful lifetime of 4 to 5 years, so if this was me buying hardware, I would certainly not consider GPUs older than V100 at this time. Note that V100 was released 5 years ago, so it is a bit long in the tooth, and you will need to carefully weigh feature lists and performance specs against price.

Thanks, I will keep your suggestions in mind. This was very helpful.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.