Still unclear on 16-bit float atomic operations for consumer Pascal GPUs

Is there any indication if there will be 16-bit atomic operations (preferably an atomicAdd()) on either the ‘half’ type as a float point value or as a 16-bit integer) ? This would be for shared or global memory (hopefully both but I will be happy with either possibility).

Made my own 16-bit unsigned int atomicAdd() hack for shared memory which I am currently using for a real-time image reconstruction, but it is not as efficient as a 32-bit atomic operation.

Even if it is not hardware supported I would guess that a NVIDIA version of a 16-bit atomicAdd() would be better than my ‘rolled-my-own’ version.

Heard rumors that this may be available in some inefficient form for the GTX line, but cannot find any documentation.

Maybe the final release of CUDA 8?

These have existed since sm_52:

RED.E.ADD.F16x2.FTZ.RN
ATOM.E.ADD.F16x2.FTZ.RN

I don’t think the shared version ATOMS.ADD.F16x2.FTZ.RN exists. I have no idea why nvidia has chosen not to expose these in cuda or ptx as of yet. I use them from time to time in my sass programming.

Maybe submit a bug report. Or perhaps cuda-8 final will finally add support.