I need to convert an array of shorts to an array of floats and I’m wondering what the best way of doing it would be.
Am I right in thinking that my primary aim should be fully coallesced reads and writes and that I will therefore have to use shared memory?
Presumably I should aim for 256 threads per block with each thread writing a successive float. Of these threads 128 will first have to each read 2 shorts. I guess its better to use all 32 threads of each of the first 4 warps and then sync rather than just use the first 16 threads of each of the 8 warps and avoid the sync? Also, is there any way to avoid a 2-way bank conflict during the conversion and does it really matter anyway?