FP16x2 ops in sm_52

A little splelunking reveals some new instructions:

Gist here.

Couldn’t get min/max ops to compile and there appears to be no support yet for the saturate modifier or round modes.

FTZ can be twiddled.

This was just a quick Saturday morning coffee experiment and I have no idea if this even works.

I’ll play around with these in my assembler. But this has the potential of doubling fp16 gemm performance. I’ll need to think about how to restructure the computation though. It may require a bunch of bit-field operations to setup the computation. So perhaps not quite 2x. Also, not sure how much error you’d accumulate in fp16.

Melt those multiprocessors!

I think FP16x2 is a new feature of sm_52?

I would be interested to hear where use of FP16 actually makes sense. Back when I worked with OpenGL-ES 1.0 in the early 2000s I pondered this a bit, and concluded it was not very useful. I looked at 32-bit fixed-point arithmetic (s15.16), seriously considered a 32-bit logarithmic number system, but eventually came back to FP32, and proceeded to optimize emulation code for that to the hilt (there was no hardware floating-point support in ARM CPUs for handheld products at that time, and existing emulation libraries were painfully slow).

FP16 only has 11 mantissa bits. One cannot afford to lose much accuracy before this becomes noticeable in the 8-bit results that are often the output of graphics computations. Relatively simple computation such as a short-vector dot product or 2D-interpolation can already incur 3-5 ulps of error, as can the evaluation of a fast math function. So that is 2-3 bits of accuracy lost right there. The availability of hardware-FMA puts a slightly different spin on the analysis, by reducing round-off error, but I am still skeptical as to the overall utility of FP16.

As for “melting” the SMs, the FP16x2 operations are presumably implemented by partitioning the existing hardware for FFMA (at least that is the technique I recall from the late 1990s), so the number of transistors switching and the power consumption should stay the same.

FP16x2 and x4 atomics are documented in a GM204 OpenGL extension. That’s all I’ve seen.

FP16xN use case: compositing!

Compositing (e.g. via alpha maps) has been around for a long time and I am not aware of performance issues on modern GPUs. Does use of FP16 improve this qualitatively or quantitatively?

allanmac, were you able to compute anything with these instructions? I have the opcodes and I can assemble them now, but they seem to just be a no-op. I’m using 2 packed values of 1.0: 0x3c003c00.

Maybe only the X1 has these instructions implemented?

GFLOPs (FP16) Peak 1024

FP16 is the Goldilocks representation for some applications – especially mobile applications.

The Maxwell X1 whitepaper calls out image processing.

Others have noted that FP16 gets you more bang-for-buck when compositing on power/thermal/size-limited devices.

The jump from the very recent old days of blending/lerping 8-bit channels to a full float4 representation is overkill for some mundane use cases.

I look at some of my own CUDA code and I could really benefit from a packed FP16x2 since it would decrease register pressure and (I’m pretty sure) have no appreciable impact on output quality.

But most important of all is the nice marketing lift you get from quoting 2x the FLOPS! :)

Nope! I just spotted them this morning and tested nothing.

Ah well, it would’ve been a nice late Christmas present.

Note that the 7.0RC ptxas has an sm_53 device (“ptxas --help”).

I am starting to lose track of all these new architectures. sm_52 is the basis for GTX 980, correct? What’s in the X1?

FP16 as a storage format for low-bandwidth Tegra GPUs or low-accuracy bandwidth-constrained use cases in general has always made sense to me. What I am wondering about is the real-world applicability of FP16 computation given the potential accuracy issues I perceive.

The primary talking points at the URLs above regarding FP16 computation seem to in the context of mobile computing and are (1) computer vision (2) more FLOPS/watt (3) provides more precision than 8-bit hardware fixed-point computation while being more efficient than FP32 (4) higher FLOPS in benchmarks and on marketing slides. As an engineer I do not care for the last one :-) It will be interesting to see how the other three items play out. I will keep an eye out for relevant papers.

Well it’s interesting that the compiler puts barrier synchronization flags on these instructions. That implies that they’re not implemented on the cuda cores (much like how DP is implemented). Wonder if GM206 supports them (or sm_53).

Yup, the GTX 9x0 is sm_52.

Besides the FP16 support the X1 whitepaper and presentations show what looks to be a 2 SMM Maxwell with sm_50’ish register (64K) and shared mem (64KB) counts. The single SMX K1 (sm_32) only had 32K regs and 48KB of shared so this is a huge jump when multiplied by 2 SMMs.

You crystallized the FP16 talking points. Maybe you should rejoin NVIDIA in the marketing dept? :)

njuffa: here’s an interesting paper:

The key results are in table 3. And there’s this quote:

“We show that the use of half precision floating point format has little to no impact on the training of neural networks”

Anyway, I’m fairly certain this is the primary application that nvidia is aiming to support.

If I understand correctly, you are implying that FP16 units are physically separate from the FFMA units? That would be puzzling to me, but the last time I was involved with processor design was prior to the year 2000, so what do I know :-). Multiplier arrays in particular are expensive pieces of silicon real estate, so presumably one would want to re-use the arrays built for a wider format by partitioning them to be re-usable for the narrower format, with little increase in overall die area to handle both formats and minimal impact on latency.

Thanks for the pointer to the paper. It seems that FP16 provides just about the precision required for deep learning applications: “We find that very low precision computation is sufficient not just for running trained networks but also for training them. For example, almost state-of-the-art results were obtained on most datasets with around 10 bits for computing activations and gradients, and 12 bits for storing updated parameters.”

FP16x2 is only implemented on Tegra X1 (sm_53).

Scott, what do you think of the possibility of writing an SGEMM like kernel that does FP16 compute, but maintains a separate FP32 accumulator. E.g., accumulate for “k” iterations in FP16, and then accumulate in FP32 into the total accumulator. How do you think the register count would fair?

Periodically accumulating to an FP32 would work, but the cost for the “dump” is high. You need to mask the top bits, CVT to FP32, then add to your fp32 accumulator, then do the same for the high fp16 (with a shift instead of a mask), resulting in 6 operations to dump the sum to a fp32 register. So you could only be better off speedwise if you were dumping after at least 10 fp16 accumulations, if not more.

As an alternative, one might consider a Kahan-style compensated FP16 sum where each each FP16 summand accumulated into the compensated sum is actually the partial sum of n FP16 operations.

Are there no cvt.f32.f16.hi and cvt.f32.f16.lo operations defined to avoid the manual masking? If not, extraction might be fastest using the PERM instruction?.

What do we actually know about fp16 on the sm_53 X1? The X1 white paper just says there’s FMA, add, and mul, but no other details than that. A min/max would be nice for finding ranges (and also for swapping in a sort).

Often forgotten: current texture sampling hardware supports fp16 natively (since Fermi) giving no-cost interpolation and expansion to fp32. This was previously only really relevant to texture mapping.

In this old forum thread I provided an example code that shows how to use FP16 texture storage (without interpolation, but that works the same for all data formats):