In the naive approach to converting an image with 32 bit integer values to an image with 16 bit short values, two threads end up writing to the same bank. For devices of compute capability 2.0, is there a clever way to do the conversion so that performance can be improved?
Thanks. In the naive approach, although I was not using shared memory, I was reading only one 32 bit value and generating one 16 bit value. Generating ushort2 or ushort4 improves performance. Although performance is still smaller than performance of un-converted copy (float to float, float2 to float2 or float4 to float4), performance does improve as we go from float->ushort to float2->ushort2 to float4->ushort4.