GPU for physics simulations

dhrubajyoti98 · December 30, 2022, 5:52am

Our research group is shifting to using GPUs for various physics simulations mostly written in C++. Most of our code utilises 64 bit floating point. We parallelise our code using OpenACC. So, during initial testing, we found that our codes run faster on A100 which has 6912 CUDA cores compared to an RTX3090 which has 10492 CUDA cores. After some digging, it seems like the “speed” of execution for such stuff depends on FP64 FLOPS which is much much higher for A100 GPU even though it has lesser number of CUDA cores. Is this all there is to it? As in, while buying further GPUs for this specific purpose of physics simulations, should we keep an eye out only for the FP64 FLOPS value…or should something else be considered? For example, memory type HBM2 vs GDDR6 or number of CUDA cores and so on?

njuffa · December 30, 2022, 7:34am

A100 FP64: 9.75 TFLOPS
RTX 3090 FP64: 0.56 TFLOPS

All consumer cards are purposefully limited in their FP64 throughput, but may have other features not present in the professional HPC line, such as ray-tracing units.

For HPC computations it is frequently the case that memory bandwidth is a more significant limiter than FP64 throughput. You would have to do roofline analysis of your codes to see which factor limits performance first. Examples: FFTs are typically limited by memory throughput, dense matrix-matrix multiplies are typically limited by compute unit throughput.

Some important real-life applications require significant amounts of data to be resident in GPU memory to run fast, so GPU memory capacity may be something to consider. Compared to the system memory bandwidth of an HPC system (200 GB/sec) and the memory bandwidth of an HPC GPU (1-2TB/sec), the PCIe interconnect between them has significantly lower throughput (e.g. PCIe4 x16 about 25 GB/sec per direction; it’s full duplex).

CUDA cores are a marketing term. Basically what is being counted there is how many FP32 FMA units are on the chip; just look at FP32 throughput instead.

To first (and possibly even second) order, HBM2 vs GDDR6 should be transparent to CUDA apps in term of performance. Just look at memory throughput (note that any theoretical memory bandwidth numbers written up in specs need to be de-rated by a factor of 0.85 to get a good idea of practically achievable bandwidth; that is also true for the host system’s system memory).

Consider whether your applications require GPUs with ECC (this is SECED capability: single error correct, double error detect).

dhrubajyoti98 · December 30, 2022, 8:04am

Our codes will not use ray tracing - so we can count those out.

Most of our operations in the compute intensive parallel loops are trig functions, exponentiations and array operations. Our data is such that they do not “generally” go beyond 10-20 GBs, so memory capacity is not that big of an issue but we sometimes use larger models.

Thanks for pointing out that CUDA cores are essentially FP32 FMA units, I did not know that.

So, if I have to order some GPU hardware, would it be better to get a previous generation HPC GPU like the V100 (since A100s are quite pricey) compared to current gen or prev gen RTX cards like 3090s or 4090s - since the FP64 throughput on V100 (or even P100) is still far more than these RTX cards?

njuffa · December 30, 2022, 8:47am

I have not been involved in HPC purchasing decisions, so I am not in a position to offer much advice. The one factor you should consider is deprecation of software support over time. If I am informed correctly, the recently released CUDA 12 removed support for all of compute capability 3.x (so Tesla K40 and Tesla K80).

I would assume that support for the Pascal architecture (compute capability 6.x) is going to be around for another two years or so, but I have no specific insights into that. HPC GPU purchases should (in my thinking) have a useful lifetime of 4 to 5 years, so if this was me buying hardware, I would certainly not consider GPUs older than V100 at this time. Note that V100 was released 5 years ago, so it is a bit long in the tooth, and you will need to carefully weigh feature lists and performance specs against price.

dhrubajyoti98 · December 30, 2022, 9:00am

Thanks, I will keep your suggestions in mind. This was very helpful.

Topic		Replies	Views
Which NVIDIA GPUs are more suitable for high-performance computing? CUDA Programming and Performance	32	6950	October 30, 2024
Hardware for CUDA development CUDA Programming and Performance	14	2082	April 24, 2014
FP64 Performance - Power Limitation - H100 vs A100 CUDA Programming and Performance	13	691	January 19, 2026
High Compute in Flight, low DRAM Bandwidth usage CUDA Programming and Performance	34	729	January 5, 2025
Parallel usage of FP64 and Tensor cores in H100 CUDA Programming and Performance hw , cuda	3	565	December 6, 2024
Recommended graphic card (OpenACC) CUDA Programming and Performance	11	2001	March 13, 2017
How to test FP64 (no tensor core) in A100 CUDA Programming and Performance cuda	6	151	November 7, 2025
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	7982	August 14, 2024
Please delay CUDA deprecation of Volta CUDA Programming and Performance	5	486	June 16, 2025
Nvidia A2: FP64 performance is lower than specified in specs CUDA Programming and Performance	3	748	October 27, 2023

GPU for physics simulations

Related topics