Memory bank access during int to short conversion


In the naive approach to converting an image with 32 bit integer values to an image with 16 bit short values, two threads end up writing to the same bank. For devices of compute capability 2.0, is there a clever way to do the conversion so that performance can be improved?


It doesn’t really matter as the conversion is fully bandwidth limited.

Anyway: Don’t go through shared memory. Read an uint2 per thread and convert to ushort2 (or read uint4 and write ushort4).

Thanks. In the naive approach, although I was not using shared memory, I was reading only one 32 bit value and generating one 16 bit value. Generating ushort2 or ushort4 improves performance. Although performance is still smaller than performance of un-converted copy (float to float, float2 to float2 or float4 to float4), performance does improve as we go from float->ushort to float2->ushort2 to float4->ushort4.