About mixed precision performance and cc 3.x

Can 4 warp schedulers feed both fp32 and fp64 cores concurrently? For example, instructions c=a+b for 32 bit and f=d+e for 64 bit variables are needed to be issued. Can schedulers issue both calculations with those 4 warp schedulers without much performance cost?

If the card has 1/3 of total fp32 cores as number of fp64 cores, can it still increase some performance using mixed precision codes?

If the answer is yes, which way below is the best performing?

  • instruction level parallelism (3x 32-bit calculation followed by 1x 64 bit calculation or fp64 first, 3x fp32 last)
  • warp level parallelism (first 3x threads in group do 32-bit, last 1x threads do 64-bit)
  • thread group level parallelism(such as whole thread group does 64-bit calc and other 4x groups do 32-bit)
  • kernel level parallelism(3 kernels 32 bit, 1 kernel 64 bit, 3 kernels 32 bit, 1 kernel 64 bit, all in different cuda streams)
  • dynamic parallelism (all threads do 32-bit but some threads spawn child kernels that do 64-bit too)

if there is not an exact answer for all scenarios, can this config (3x 32-bit cores + 1x 64-bit cores) get a performance boost of %15 or more when compared to pure fp32 version?

Thank you for your time

Beside the well-publicised number of FPUs (“cores”) there is an equally important bottleneck in register file bandwidth. This is already noticeable in the third set of single precision FPUs often not being of much real world use as they are bandwidth-starved.

It is no coincidence that the number of double-precision FPUs exactly matches that bandwidth cliff. Trying to serve them in parallel would require twice the bandwidth, and I am sure Nvidia would have found other good uses for that rather than hiding it’s existence for years.

On a side note, I (and others of course) have been wondering for years why Nvidia decided to add that third set of single precision FPUs at all. I am hearing that graphics workloads (which I am not familiar with) somehow make better use of that third set. But looking at recent architectures, Nvidia also seems to have come to the conclusion that it wasn’t really worth it.

Thanks for the detailed info. Isn’t register access a zero-latency operation? Can’t they do 3-way or at least 2-way pipelining before fp32 or fp64 completes? Whats wrong with the 3rd set? Do you mean 2/1 ratio would be better for 32/64 instead of 3/1? Is register bandwidth independent of texture cache, constant cache, shared memory and L1 bandwidths? Can those memory types stream data to remaining starved cores with right cuda coding? I mean, if register bandwidth is 1TB/s and it still needs 250 GB/s more, other memories any help here concurrently? Or everything passes the same pipe and get serialized(which I wish not happens) before reaching ALU/FPU (I mean, each FPU has multiple entrances or single data entrance/fetch)?

Or, you mean two FP32 melds into a FP64 to compute 64 bit, virtually, rather than a real 64-bit core?

As a load/store architecture, all FPU operands come from the register file (with the constant cache being one exception in the Nvidia architectures). The register file serves to absorb the large and highly variable latency of memory accesses (as well as acting as an explicitly managed cache). Feeding FPUs directly from memory would expose them to that latency and lead to poor overall FPU utilization.

The one exception is the constant cache, which could just be seen as immediate operands stored outside the program code. Fused multiply-add (a three-input operation) can take (at least?) one operand directly from constant cache at full throughput, which is one way to achieve full utilisation of all three sets of single precision FPUs despite the register file bandwidth limitation.
Apparently multiplication by and/or addition with constants is used more often in graphics workloads than in the contexts I am familiar with, so Nvidia decided to keep the third set of FPUs for a few generations.

The other way to achieve full single precision FPU throughput on Kepler that i am aware of is to use the same operand more than once in the same operation (although that seems to be of limited usefulness). Maxwell and Kepler expand on this capability with their reuse flags. Scott Gray I am sure can tell you much more about this than I can.

Same thing goes for SFU patches too? Are they virtual again? Or would I get boost with my own sin() implementation + native sin() function?

I mean that two FP32 operations have the same bandwidth requirement as one FP64 operation (even though the FP64 operation requires more than twice the silicon of an FP32 operation if it were to run at the same speed).

Your own sin() implementation would require several instructions, and it would be quite likely somewhere in there is an otherwise unused opportunity to dual-issue the SFU sin() instruction. Yet that would give you nowhere near twice the SFU throughput, because the own implementation only has a fraction of the SFU throughput.

So if I flag something as “re-use” to be used in same native function such as multiplication, division, … then I can use a variable second operand(but comes from constant cache) and a third target operand to push limits?

So special function unit could get some %10 or %30-ish more performance by using FP32 implementations(but not %100 as you say)?

What do you mean by “SFU patches”, and in what sense might they be virtual?

Special function units. Near the FP64 patches in images of microarchitecture of K80.

So, texture cache, L1, shared memory, local memory feed register file. Register file and constant cache feed cores. But if an operand is re-used, then register file feeds better, all cores.

I’m not trying to fully synchronize all cores. Just wondering if at any cycle time, can everything be overlapped at least for a single stage of their pipelines.

I should have expressed myself more clearly. I am well aware of what an SFU is, I was wondering about the use of “patches”. Now I understand that was a reference to a particular region in a symbolic representation of GPU execution resources.

I am reasonably sure that, subject to resource restrictions such as register file ports, SFU operations can be dual-issued together with a “regular” FPU instruction in Maxwell and later architectures, but not Kepler-class devices.

Normally the compiler will take care of exploiting such opportunities where they exist. Full control by the programmer would require directly programming at SASS level. You might want to peruse Scott Gray’s publicly posted information on what exactly is possible on a Maxwell-class GPU if you plan to go down that path.

I want to stay in pure cuda path. If compiler takes everything to optimize that, then I’m fine.

Maybe I should have used “tile”, “pipeline”, “core” or “module” instead of “patch”. I don’t really know what each element in the picture is called in common.

Thank you njuffa and tera.