What Maddy is proposing may be possible. Conversions can be in place (and dual issued) like so:
F2F.F32.F16 f32_lo, f16;
F2F.F32.F16 f32_hi, f16.H1;
Register count might be tricky, but some compromise can likely be made to keep the compute to memory op ratio close to optimal.
Kahan-stlye accumulation is probably also worth looking at.
Depending on the depth of matrices you need to compute, fp16 accumulation might be close enough in some cases.
Hoping this comes out soon and supports these instructions so I can play around with them:
FP16 is very useful as a storage format in image processing applications as it reduces bandwidth by half compared with FP32. It es sufficient for stuff like a wavelet transform with a couple of levels (meaning that the difference between the transform using FP16 and FP32 is insignificant). Note that all calculations are done in FP32, only the storage format is FP16. We will show at GTC 2015 (session S5152, GPU-Accelerated Undecimated Wavelet Transform for Film and Video Denoising), how it is sucessfully used within a novel high-quality denoising algorithm. I suppose that FP16 (storage) is sufficient also for other transforms like DCT.
FP16 format seems to be also useful not only when used as storage format, but when the calculations are also done in FP16 (instead FP32) - like in the new Tegra X1, but one might have to take more care. Like already mentioned for deep learning. Also for other convolutional - like stuff from image processing. See e.g. paper “16-bit floating point operations for low-end and high-end embedded processors” at https://www.lri.fr/~lacas/Publications/ODES05.pdf