High Compute in Flight, low DRAM Bandwidth usage

SBenner · January 1, 2025, 7:44pm

I have created a numerical solver for large systems of equations which works fairly well, with greater than 95% compute in flight, but low memory bandwidth usage around 2%. I don’t understand how this can be the case, I assumed I would be bandwidth limited if my cores are working so hard. Is this because I am using double precision, which is emulated with 32 CUDA cores? Am I misinterpreting the results?

rs277 · January 1, 2025, 7:54pm

You need to give us some more detail - Nsight Compute will highlight bottlenecks.

If you are using any GPU that isn’t a data centre/Tesla class card, you very possibly are being limited by FP64 performance as any other cards have limited FP64 cores - see here for throughput performance.

njuffa · January 1, 2025, 8:28pm

Double-precision arithmetic using the double type is supported directly in hardware on all GPUs that have shipped since 2009 or so. It does not involve emulation. However, on many consumer cards there are relatively few FP64 execution units, resulting in comparatively low throughput.

SBenner · January 1, 2025, 9:03pm

My research showed that consumer GPUs only have FP32 cores, ist that wrong? And that they emulate doubles with 32 CUDA cores. I also found the H100 has a ratio of 1:2, and the RTX A6000 a ratio of 1:64. Is this incorrect information?

SBenner · January 1, 2025, 9:09pm

Thanks for the link to the chart, that is extremely helpful. The results I mentioned came from Nsight Systems. I’ve used Compute, and have it open in front of me right now, but I don’t see it specifically telling me what the bottleneck is on a hardware level. It mentions things like “small grid” and “workload imbalance”, but these aren’t hardware attributes, which is what I am investigating at the moment.

njuffa · January 1, 2025, 9:09pm

This is incorrect information. Where is this emulation story coming from, ChatGPT?

This is correct information.

This is the ratio of FP64 execution units to FP32 execution units. Enterprise-class HPC GPUs have a small ratio, consumer-class GPU have a large ratio. The ratio differs by GPU architecture.

For what it is worth, some while ago I did demonstrate in these forums that on modern consumer GPUs with 1:64 ratio between FP32 and FP64 one can emulate some FP64 operations with higher throughput than that provided by native hardware, albeit with increased register pressure.

SBenner · January 1, 2025, 9:12pm

Initially, yes. But I followed up and found mentions that CUDA uses an entire warp to emulate a double on consumer GPUs. Can you tell me where I can find accurate technical information on these ratios on a per-GPU basis?

I see from the Arithmetic Instructions table listed above that the performance ratio is dependent on the operation and compute capability, and that for a simple multiplication on my 3070Ti Mobile (compute 8.6), I get two results per warp per SM. Does that mean my card uses a ratio of 1:64? And more importantly, is this a design choice in the physical hardware?

njuffa · January 1, 2025, 9:18pm

That is correct, the RTX 3070 (Ti) Mobile is based on the Ampere architecture and has a 1:64 FP64 to FP32 ratio.

I usually use the TechPowerUp database. Note that this database is maintained through volunteer efforts (think Wikipedia) so there are no guarantees of correctness. However, I have found the data to be generally reliable.

Word of advice: Never use ChatGPT to research anything. Use it when you want to read a nice story and you don’t care for its veracity.

rs277 · January 1, 2025, 9:23pm

Re FP64 cores, if you look at the diagram on page 22 here, you can see an illustration of a Tesla A100 SM, showing 32 FP64 cores per SM. The same family (Ampere) consumer line of cards only has 2 FP64 cores per SM, see caption below Fig. 2 page 10 here.

SBenner · January 1, 2025, 9:25pm

So it’s not emulation? The operations actually run on dedicated hardware, meaning the FP32 cores do nothing when running double precision code?

If that’s the case, when Nsight Systems says 85% of compute warps are in flight, is it making that calculations based on the total number of FP64 cores only?

rs277 · January 1, 2025, 9:26pm

No, as the caption says, "168 FP64 units (two per SM), which are not depicted in this
diagram. "

rs277 · January 1, 2025, 9:39pm

Greg’s comment in this thread maybe offer some more perspective from the profiler side.

SBenner · January 1, 2025, 9:53pm

Then I have one final question. When a GPU lists a cuda core count, such as 5888 on the 3070 Ti Mobile, does that include all cores, or just the FP32 cores? Are INT32 cores included in that as well?

njuffa · January 1, 2025, 10:50pm

NVIDIA uses the marketing term “CUDA cores”. This is just the count of standalone FP32 cores. Whether INT32 and FP32 are handled in different pipes or shared pipes has varied between GPU architectures.

If you want to roughly plan computational throughput, I would highly recommend referring to the throughput tables in the CUDA Programming Guide that you have already been pointed at.

One particular quirk of recent GPU architectures is that throughput for MUFU operations has been reduced relative to FP32 ops when compared with older architectures. This means that some tradeoffs between MUFU and FP32 that used to be valid for a decade may no longer apply.

Curefab · January 1, 2025, 11:33pm

Of standalone FP32 cores and of hybrid FP32+INT32 cores, for example Ampere has 128 “Cuda cores” per SM, 64 FP32 and 64 hybrid ones.

Whereas Turing has 64 “Cuda cores” per SM, 64 FP32, the 64 INT32 do not count for that mostly marketing term.

All in all, Nvidia is not too bad with marketing language. Cuda Cores is how many FP32 operations (including FMA, fused multiply-add; so 2 FLOP) can be done per cycle.

njuffa · January 1, 2025, 11:36pm

By “standalone” I meant without taking into account FP32 capabilities of tensor cores. I should have expressed myself more clearly.

The marketing angle of CUDA cores is in the use of the term “cores”. These are raw execution units that have no resemblance with cores people may be familiar with from CPUs. In my experience, marketing people will simply glom onto the biggest number in sight, whether it makes technical sense or not. 16,000 CUDA cores compared to a puny 128-core CPU makes GPUs sound more powerful than they are in actual application-level performance.

Curefab · January 1, 2025, 11:49pm

Could it be that it was true for Fermi (cc 2.x)?

This article mentions a change between Fermi and Kepler:

The other change coming from GF114 is the mysterious block #15, the CUDA FP64 block. In order to conserve die space while still offering FP64 capabilities on GF114, NVIDIA only made one of the three CUDA core blocks FP64 capable. In turn that block of CUDA cores could execute FP64 instructions at a rate of ¼ FP32 performance, which gave the SM a total FP64 throughput rate of 1/12th FP32. In GK104 none of the regular CUDA core blocks are FP64 capable; in its place we have what we’re calling the CUDA FP64 block.

https://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/2

But that is long ago and predates the RTX 3070.

Curefab · January 1, 2025, 11:57pm

As they can do between 8 and 16 (depending on type) FP32 operations per cycle per core (including Fused-Multiply-Add, so 16 to 32 FLOP).
So 128 CPU cores would be comparable to 1024 to 2048 CUDA cores.
With a warp more or less represented as AVX vector.

In addition CPUs often have 2x to 3x the clock frequency.

So the puny 128-core CPU has some power after all ;-)

njuffa · January 2, 2025, 12:12am

Prior to the introduction of FP64 hardware the CUDA compiler simply demoted double to float. This was a consequence of the fact that the amount of hardware specifically added for CUDA on top of what was need for graphics needed to be minimal, I seem to recall it represented 3% of silicon real estate in G80. At the time, CUDA was an unproven feature not driving revenue generation but instead costing money in terms of engineering resources and hardware production costs. Both man power and time scale available for initial CUDA development precluded offering FP64 emulation.

In processors supporting both FP32 and FP64, these capabilities can either be supplied by shared hardware, or by entirely separate hardware units. The former is often a good idea if one plans to offer FP64 at a fixed rate to FP32, say1:2 or 1:4, across the board, whereas the latter approach is superior if one plans market segmentation using the minimum throughput of FP64 operations some market segments will tolerate. My memory is hazy, but I seem to recall that NVDIA went with the shared datapath approach for the initial FP64 implementation, and then quickly swiveled to use of separate function units, allowing them to hand pick the FP32:FP64 ratio in later architecture.

Obviously, there are basic FP64 operations that have always been emulated in software since double was first properly supported in CUDA, namely division and square root. That is just a consequence of the fairy extreme RISC approach (with emphasis on “reduced”) NVIDIA GPUs have adopted.

Side remark: This summer CUDA will turn 20 years old. It started with a tiny engineering team in the summer of 2005, leading to an alpha-release under NDA in the fall of 2006, followed by public availability in February of 2007. Next to my involvement with the AMD Athlon processor the most exciting project I have ever worked on.

Robert_Crovella · January 2, 2025, 3:54pm

Thanks for your contributions njuffa to this very important project that was initiated some 20 years ago, and thanks for your continued interest and involvement.

Topic		Replies	Views
Fermi architecture details where can I find them? CUDA Programming and Performance	16	4007	April 8, 2012
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	1768	September 27, 2024
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	2902	October 5, 2022
Mixed-Precision Programming with CUDA 8 Technical Blog	1	391	February 23, 2017
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19653	July 5, 2011
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	26918	December 23, 2014
Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m CUDA Programming and Performance	63	10331	April 5, 2012
Which NVIDIA GPUs are more suitable for high-performance computing? CUDA Programming and Performance	33	1894	November 13, 2024
Weekend project: FP64 emulation on consumer-grade GPUs CUDA Programming and Performance	5	1850	April 29, 2024
Forward looking GPU integer performance CUDA Programming and Performance	22	21555	March 20, 2017

High Compute in Flight, low DRAM Bandwidth usage

Related topics