Performance difference between atomicAdd() on 32 bit words(float, int, unsigned int).

Been running some tests where I have a discrete set of values which can be mapped/converted to either the float, int or unsigned int type. Regardless of the cast the values will be integers(0:65535).

I was trying to see if there are performance differences using the three casting options for global atomicAdd() on the values in my limited application.

Overall it seems that atomicAdd() on the float type may be slightly faster than atomicAdd() on signed int.

Should there be any performance difference based in the type of 32 bit word updated during an atomicAdd() ?

Maybe txbob can provide an authoritative answer. My knowledge is limited, but my understanding of atomicAdd() on global memory is that these operations are executed by dedicated shared float/int ALUs, meaning the throughput should be identical for either type. But I am not at all sure about this, and the details may differ by architecture. What architecture are you observing this on?

How much of a performance difference are you seeing? Is it possible that performance differences are due to secondary effects, e.g. additional instructions needed depending on whether the data feeding into the atomicAdd() is float or int? Even if the instruction counts outside the atomicAdd() is identical with either variant, the instruction mix and thus throughput may be slightly more favorable for the float version.

This was tested on a reference Maxwell Titan X.

I believe you are correct in that any observed difference may be related to the additional instructions.

Ultimately very little difference was noticed, so I am going to keep it simple and stick with float.