High Compute in Flight, low DRAM Bandwidth usage

I have created a numerical solver for large systems of equations which works fairly well, with greater than 95% compute in flight, but low memory bandwidth usage around 2%. I don’t understand how this can be the case, I assumed I would be bandwidth limited if my cores are working so hard. Is this because I am using double precision, which is emulated with 32 CUDA cores? Am I misinterpreting the results?

You need to give us some more detail - Nsight Compute will highlight bottlenecks.

If you are using any GPU that isn’t a data centre/Tesla class card, you very possibly are being limited by FP64 performance as any other cards have limited FP64 cores - see here for throughput performance.

Double-precision arithmetic using the double type is supported directly in hardware on all GPUs that have shipped since 2009 or so. It does not involve emulation. However, on many consumer cards there are relatively few FP64 execution units, resulting in comparatively low throughput.

My research showed that consumer GPUs only have FP32 cores, ist that wrong? And that they emulate doubles with 32 CUDA cores. I also found the H100 has a ratio of 1:2, and the RTX A6000 a ratio of 1:64. Is this incorrect information?

Thanks for the link to the chart, that is extremely helpful. The results I mentioned came from Nsight Systems. I’ve used Compute, and have it open in front of me right now, but I don’t see it specifically telling me what the bottleneck is on a hardware level. It mentions things like “small grid” and “workload imbalance”, but these aren’t hardware attributes, which is what I am investigating at the moment.

This is incorrect information. Where is this emulation story coming from, ChatGPT?

This is correct information.

This is the ratio of FP64 execution units to FP32 execution units. Enterprise-class HPC GPUs have a small ratio, consumer-class GPU have a large ratio. The ratio differs by GPU architecture.

For what it is worth, some while ago I did demonstrate in these forums that on modern consumer GPUs with 1:64 ratio between FP32 and FP64 one can emulate some FP64 operations with higher throughput than that provided by native hardware, albeit with increased register pressure.

1 Like

Initially, yes. But I followed up and found mentions that CUDA uses an entire warp to emulate a double on consumer GPUs. Can you tell me where I can find accurate technical information on these ratios on a per-GPU basis?

I see from the Arithmetic Instructions table listed above that the performance ratio is dependent on the operation and compute capability, and that for a simple multiplication on my 3070Ti Mobile (compute 8.6), I get two results per warp per SM. Does that mean my card uses a ratio of 1:64? And more importantly, is this a design choice in the physical hardware?

That is correct, the RTX 3070 (Ti) Mobile is based on the Ampere architecture and has a 1:64 FP64 to FP32 ratio.

I usually use the TechPowerUp database. Note that this database is maintained through volunteer efforts (think Wikipedia) so there are no guarantees of correctness. However, I have found the data to be generally reliable.

Word of advice: Never use ChatGPT to research anything. Use it when you want to read a nice story and you don’t care for its veracity.

1 Like

Re FP64 cores, if you look at the diagram on page 22 here, you can see an illustration of a Tesla A100 SM, showing 32 FP64 cores per SM. The same family (Ampere) consumer line of cards only has 2 FP64 cores per SM, see caption below Fig. 2 page 10 here.

1 Like

So it’s not emulation? The operations actually run on dedicated hardware, meaning the FP32 cores do nothing when running double precision code?

If that’s the case, when Nsight Systems says 85% of compute warps are in flight, is it making that calculations based on the total number of FP64 cores only?

No, as the caption says, "168 FP64 units (two per SM), which are not depicted in this
diagram. "

1 Like

Greg’s comment in this thread maybe offer some more perspective from the profiler side.

1 Like

Then I have one final question. When a GPU lists a cuda core count, such as 5888 on the 3070 Ti Mobile, does that include all cores, or just the FP32 cores? Are INT32 cores included in that as well?

NVIDIA uses the marketing term “CUDA cores”. This is just the count of standalone FP32 cores. Whether INT32 and FP32 are handled in different pipes or shared pipes has varied between GPU architectures.

If you want to roughly plan computational throughput, I would highly recommend referring to the throughput tables in the CUDA Programming Guide that you have already been pointed at.

One particular quirk of recent GPU architectures is that throughput for MUFU operations has been reduced relative to FP32 ops when compared with older architectures. This means that some tradeoffs between MUFU and FP32 that used to be valid for a decade may no longer apply.

1 Like

Of standalone FP32 cores and of hybrid FP32+INT32 cores, for example Ampere has 128 “Cuda cores” per SM, 64 FP32 and 64 hybrid ones.

Whereas Turing has 64 “Cuda cores” per SM, 64 FP32, the 64 INT32 do not count for that mostly marketing term.

All in all, Nvidia is not too bad with marketing language. Cuda Cores is how many FP32 operations (including FMA, fused multiply-add; so 2 FLOP) can be done per cycle.

By “standalone” I meant without taking into account FP32 capabilities of tensor cores. I should have expressed myself more clearly.

The marketing angle of CUDA cores is in the use of the term “cores”. These are raw execution units that have no resemblance with cores people may be familiar with from CPUs. In my experience, marketing people will simply glom onto the biggest number in sight, whether it makes technical sense or not. 16,000 CUDA cores compared to a puny 128-core CPU makes GPUs sound more powerful than they are in actual application-level performance.

Could it be that it was true for Fermi (cc 2.x)?

This article mentions a change between Fermi and Kepler:

The other change coming from GF114 is the mysterious block #15, the CUDA FP64 block. In order to conserve die space while still offering FP64 capabilities on GF114, NVIDIA only made one of the three CUDA core blocks FP64 capable. In turn that block of CUDA cores could execute FP64 instructions at a rate of ¼ FP32 performance, which gave the SM a total FP64 throughput rate of 1/12th FP32. In GK104 none of the regular CUDA core blocks are FP64 capable; in its place we have what we’re calling the CUDA FP64 block.

https://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/2

But that is long ago and predates the RTX 3070.

As they can do between 8 and 16 (depending on type) FP32 operations per cycle per core (including Fused-Multiply-Add, so 16 to 32 FLOP).
So 128 CPU cores would be comparable to 1024 to 2048 CUDA cores.
With a warp more or less represented as AVX vector.

In addition CPUs often have 2x to 3x the clock frequency.

So the puny 128-core CPU has some power after all ;-)

Prior to the introduction of FP64 hardware the CUDA compiler simply demoted double to float. This was a consequence of the fact that the amount of hardware specifically added for CUDA on top of what was need for graphics needed to be minimal, I seem to recall it represented 3% of silicon real estate in G80. At the time, CUDA was an unproven feature not driving revenue generation but instead costing money in terms of engineering resources and hardware production costs. Both man power and time scale available for initial CUDA development precluded offering FP64 emulation.

In processors supporting both FP32 and FP64, these capabilities can either be supplied by shared hardware, or by entirely separate hardware units. The former is often a good idea if one plans to offer FP64 at a fixed rate to FP32, say1:2 or 1:4, across the board, whereas the latter approach is superior if one plans market segmentation using the minimum throughput of FP64 operations some market segments will tolerate. My memory is hazy, but I seem to recall that NVDIA went with the shared datapath approach for the initial FP64 implementation, and then quickly swiveled to use of separate function units, allowing them to hand pick the FP32:FP64 ratio in later architecture.

Obviously, there are basic FP64 operations that have always been emulated in software since double was first properly supported in CUDA, namely division and square root. That is just a consequence of the fairy extreme RISC approach (with emphasis on “reduced”) NVIDIA GPUs have adopted.

Side remark: This summer CUDA will turn 20 years old. It started with a tiny engineering team in the summer of 2005, leading to an alpha-release under NDA in the fall of 2006, followed by public availability in February of 2007. Next to my involvement with the AMD Athlon processor the most exciting project I have ever worked on.

1 Like

Thanks for your contributions njuffa to this very important project that was initiated some 20 years ago, and thanks for your continued interest and involvement.

1 Like