Convert array of shorts to array of floats

I need to convert an array of shorts to an array of floats and I’m wondering what the best way of doing it would be.

Am I right in thinking that my primary aim should be fully coallesced reads and writes and that I will therefore have to use shared memory?

Presumably I should aim for 256 threads per block with each thread writing a successive float. Of these threads 128 will first have to each read 2 shorts. I guess its better to use all 32 threads of each of the first 4 warps and then sync rather than just use the first 16 threads of each of the 8 warps and avoid the sync? Also, is there any way to avoid a 2-way bank conflict during the conversion and does it really matter anyway?

My first instinct is to have 256 threads read 256 floats. Then, each thread converts its float to a short at the same shared memory position by casting the shared array from float to short. (No conflicts.)

Then, 128 threads read two shorts into a short2, and write it out coalesced.