A lone unit would make sense if it were inside the single fp64 unit, but it’s been stated several times that the fp16x2 units are actually part of the fp32 SPs.
Microcode emulation would be both easier (just software!), cheaper (no extra transistors), and more performant than a lone 1/64 or 1/32 rate hardware unit. But it doesn’t explain the 1/64 rate quotes.
FMA on a narrow floating-point type isn’t trivially emulated by FMA on a wider floating-point type, at least not if you need to get denormal results correct, and as far as I know NVIDIA provides FP16 with denormal support. So emulation of FP16 via FP32 seems unlikely, and any microcode implementation seems very unlikely since GPUs typically do not have the machinery for that.
Using a scalar FP16 unit to emulate FP16x2 SIMD via simple state machine would certainly be possible; the emulation of wider SIMD via narrower SIMD has been used extensively in x86 processors, particularly for first generation implementations.
The most straightforward hypothesis is that the low throughput is simply due to the use of a tiny “native FP16x2” unit that is provided for architectural compatibility, in the same way this is done for double precision units. The motivation for this approach would be the same as in the case of double precision: Differentiate parts by target market, in offering small die, low power, low cost GPUs for the mass market, and big die, higher power, more fully featured, high cost GPUs for specialized markets. From what I understand, high FP16x2 throughput is needed primarily for the training phase, for which relevant major customers presumably buy high-end GPUs, and it would be beneficial to NVIDIA to keep it that way.
Sorry I goofed on the throughput measurement. I was only using a single warp on a single SM to measure it. Though it’s not clear to me how that would seem to show it executing a list of these instructions twice as fast as it should.
Anyway, maxing out the gpu with those instructions shows a throughput of 1/128. So as njuffa mentions there is probably a single fp16 unit per SM. The latency of the instruction is entirely consistent with this setup (and not at all with it being implemented on the cuda cores).
The double precision throughput is 1/32, just like maxwell.
@allanmac: I don’t have a GTX 1080 to try, but from the information available so far it would seem your manual emulation should be faster, modulo any register pressure effects due to the emulation.
The P100 whitepaper states:“One new capability that has been added to GP100’s FP32 CUDA Cores is the ability to process both 16-bit and 32-bit precision instructions and data”
A.A i am doing a project of sparse matrix vector multiplication using CSR format on GPU. i am geting an error in my code.
can anyone have code of this for CUDA 7.5
Perhaps you should define what you mean by concrete.
Excerpting from the PTX 5.0 ISA document that ships with CUDA 8 RC:
‣ Extends atomic and reduction instructions to perform fp64 add operation.
‣ A new dp4a instruction which allows 4-way dot product with accumulate operation.
‣ A new dp2a instruction which allows 2-way dot product with accumulate operation.
dp4a for example requires sm_61 or higher, so it is not even in sm_60 (Tesla P100)
Does SM 6.1 have the same updated Unified Memory system features of SM 6.0, like system-wide addressing and coherent page faulting? The P100 whitepaper and Mark Harris’s Pascal introduction carefully call these P100 features, not Pascal’s.
"So I got an email from NVIDIA this morning. We can finally lay the question of FP16 execution to rest once and for all.
GP104 has a single, dedicated FP16x2 core per SM. The FP32 cores cannot execute FP16x2.
This is basically identical to how NVIDIA does FP64, except GP104 has more FP64 units (4 per SM). This is where the 1/128 instruction rate comes from, and since it’s capable of executing 2 FP16 ops in a vec2, the resulting 1/64 FLOP rate. This also means that it takes 32 clocks to actually execute a single instruction of a single warp."
Are you able to switch to Pstate P0 the 1070 and 1080 in a cuda app ? I was using a small trick on the 9xx with application clocks (setting them to the highest mem clock was switching the card to P0)
Its a problem because P2 mem clock is reduced (and not user editable)
I am securing a 1070 (MSI GeForce GTX 1070 DirectX 12 GTX 1070 GAMING X 8G 8GB) in the coming week and will run code/tests/commands people are interested in. Start posting up things you would like to see
Windows 7/ubuntu 16.04 available.
I will create a custom thread by mid week 7/14/16