Nvidia Pascal TITAN Xp, TITAN X, GeForce GTX 1080 Ti, GTX 1080, GTX 1070, GTX 1060, GTX 1050 & GT 1030

SPWorley · June 9, 2016, 6:11pm

A lone unit would make sense if it were inside the single fp64 unit, but it’s been stated several times that the fp16x2 units are actually part of the fp32 SPs.

Microcode emulation would be both easier (just software!), cheaper (no extra transistors), and more performant than a lone 1/64 or 1/32 rate hardware unit. But it doesn’t explain the 1/64 rate quotes.

allanmac · June 9, 2016, 6:32pm

Yup, understood. I bring it up because one developer is observing 1/6th throughput:

Very fuzzy math if HFMA2 were microcoded/emulated:

FMA.FP32   :   1 op
FMA.FP16x2 : ~11 ops to unpack→compute→repack an HFMA2 ➔ 2 FP16s every ~11 ops ≅ 1/6th

Totally guessing. It would be best if NVIDIA just told us. :)

njuffa · June 9, 2016, 6:32pm

FMA on a narrow floating-point type isn’t trivially emulated by FMA on a wider floating-point type, at least not if you need to get denormal results correct, and as far as I know NVIDIA provides FP16 with denormal support. So emulation of FP16 via FP32 seems unlikely, and any microcode implementation seems very unlikely since GPUs typically do not have the machinery for that.

Using a scalar FP16 unit to emulate FP16x2 SIMD via simple state machine would certainly be possible; the emulation of wider SIMD via narrower SIMD has been used extensively in x86 processors, particularly for first generation implementations.

The most straightforward hypothesis is that the low throughput is simply due to the use of a tiny “native FP16x2” unit that is provided for architectural compatibility, in the same way this is done for double precision units. The motivation for this approach would be the same as in the case of double precision: Differentiate parts by target market, in offering small die, low power, low cost GPUs for the mass market, and big die, higher power, more fully featured, high cost GPUs for specialized markets. From what I understand, high FP16x2 throughput is needed primarily for the training phase, for which relevant major customers presumably buy high-end GPUs, and it would be beneficial to NVIDIA to keep it that way.

allanmac · June 9, 2016, 6:47pm

Yes, but the open question is which code path is faster?

half2 fma_half2(half2 a, half2 b, half2 c)
{
#if __CUDA_ARCH__ >= 530
  return __hfma2(a,b,c);
#else
  return __floats2half2_rn(fmaf( __low2float(a), __low2float(b), __low2float(c)),
                           fmaf(__high2float(a),__high2float(b),__high2float(c)));
#endif
}

The second path compiles to this:

F2F.F32.F16 R3, R0;
F2F.F32.F16 R6, R0.H1;
F2F.F32.F16 R4, R1;
F2F.F32.F16 R7, R1.H1;
F2F.F32.F16 R5, R2;
F2F.F32.F16 R8, R2.H1;
FFMA R3, R3, R4, R5;
FFMA R4, R6, R7, R8;
F2F.F16.F32 R0, R3;
F2F.F16.F32 R5, R4;
XMAD.PSL.CLO R0, R5, 0x1, R0;

scottgray · June 9, 2016, 6:59pm

Sorry I goofed on the throughput measurement. I was only using a single warp on a single SM to measure it. Though it’s not clear to me how that would seem to show it executing a list of these instructions twice as fast as it should.

Anyway, maxing out the gpu with those instructions shows a throughput of 1/128. So as njuffa mentions there is probably a single fp16 unit per SM. The latency of the instruction is entirely consistent with this setup (and not at all with it being implemented on the cuda cores).

The double precision throughput is 1/32, just like maxwell.

njuffa · June 9, 2016, 7:10pm

@allanmac: I don’t have a GTX 1080 to try, but from the information available so far it would seem your manual emulation should be faster, modulo any register pressure effects due to the emulation.

scottgray · June 9, 2016, 7:31pm

F2F has a throughput of 1/4. You have 8 of them so that would be limited to 1/32 throughput, still 4x faster than HFMA2 on sm_61.

Robert_Crovella · June 10, 2016, 2:05am

I don’t know what context that might have been stated in. It’s quite possible that such a statement is true for sm_60 but not true for sm_61.

And I don’t see any reason why a single FP16 unit (or F16x2) would need to be “inside” an FP64 unit either.

SPWorley · June 10, 2016, 3:15am

The P100 whitepaper states:“One new capability that has been added to GP100’s FP32 CUDA Cores is the ability to process both 16-bit and 32-bit precision instructions and data”

Robert_Crovella · June 10, 2016, 3:43am

GP100 = sm_60
GP104 = sm_61

There’s no particular reason GP104 has to be architected the same way as GP100. (Clearly, it is not.)

robeala · June 10, 2016, 5:29am

A.A i am doing a project of sparse matrix vector multiplication using CSR format on GPU. i am geting an error in my code.
can anyone have code of this for CUDA 7.5

NVD · June 12, 2016, 3:40am

Here’s some CUDA benchmarks on Linux.

LukeCuda · June 19, 2016, 3:22am

NVIDIA GeForce GTX 1070 On Linux: Testing WiNVIDIA GeForce GTX 1070 On Linux: Testing With OpenGL, OpenCL, CUDA & VulkanN

Epsylon3 · June 20, 2016, 1:48am

so, actually, to resume, we can say SM 6.1 has no concrete advantages over SM 5.2 on the GTX 10xx… right ?

Robert_Crovella · June 20, 2016, 3:29am

Perhaps you should define what you mean by concrete.

Excerpting from the PTX 5.0 ISA document that ships with CUDA 8 RC:

‣ Extends atomic and reduction instructions to perform fp64 add operation.
‣ A new dp4a instruction which allows 4-way dot product with accumulate operation.
‣ A new dp2a instruction which allows 2-way dot product with accumulate operation.

dp4a for example requires sm_61 or higher, so it is not even in sm_60 (Tesla P100)

SPWorley · June 20, 2016, 3:43am

Does SM 6.1 have the same updated Unified Memory system features of SM 6.0, like system-wide addressing and coherent page faulting? The P100 whitepaper and Mark Harris’s Pascal introduction carefully call these P100 features, not Pascal’s.

NVD · June 23, 2016, 2:07am

"So I got an email from NVIDIA this morning. We can finally lay the question of FP16 execution to rest once and for all.

GP104 has a single, dedicated FP16x2 core per SM. The FP32 cores cannot execute FP16x2.

This is basically identical to how NVIDIA does FP64, except GP104 has more FP64 units (4 per SM). This is where the 1/128 instruction rate comes from, and since it’s capable of executing 2 FP16 ops in a vec2, the resulting 1/64 FLOP rate. This also means that it takes 32 clocks to actually execute a single instruction of a single warp."

Epsylon3 · June 23, 2016, 6:27pm

Are you able to switch to Pstate P0 the 1070 and 1080 in a cuda app ? I was using a small trick on the 9xx with application clocks (setting them to the highest mem clock was switching the card to P0)

Its a problem because P2 mem clock is reduced (and not user editable)

NVD · July 9, 2016, 5:53am

http://www.geforce.com/hardware/10series/geforce-gtx-1060

GTX 1060 with 6GB GDDR5, 192bit memory bus, out on July 19th.

ldaddr · July 10, 2016, 7:34pm

I am securing a 1070 (MSI GeForce GTX 1070 DirectX 12 GTX 1070 GAMING X 8G 8GB) in the coming week and will run code/tests/commands people are interested in. Start posting up things you would like to see

Windows 7/ubuntu 16.04 available.
I will create a custom thread by mid week 7/14/16

Topic		Replies	Views
GTX 460 CUDA Programming and Performance	58	60207	August 5, 2010
GTX 580 is not as good as GTX480 for CUDA ? CUDA Programming and Performance	23	3890	November 7, 2010
Is nvidia forcing SP compute customers into expensive cards? Why is SP Cuda so slow on gtx680? Somet CUDA Programming and Performance	49	13186	May 20, 2012
TITAN X CUDA Programming and Performance	35	10399	March 23, 2015
CUDA 8 Features Revealed Technical Blog	51	870	November 8, 2018
Sounds like GK208 laptops/cards will support most sm_35 features CUDA Programming and Performance	102	29115	August 12, 2014
Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m CUDA Programming and Performance	63	10339	April 5, 2012
So what's new about Maxwell? CUDA Programming and Performance	166	55928	March 10, 2015
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	26938	December 23, 2014
GTX 480 / 470 Double Precision Reduced? CUDA Programming and Performance	178	265919	October 9, 2010

Nvidia Pascal TITAN Xp, TITAN X, GeForce GTX 1080 Ti, GTX 1080, GTX 1070, GTX 1060, GTX 1050 & GT 1030

Related topics