But If I look at the .ptx that is produced, the compiler does not issue a single 128bit read for ‘temp’, but a 64-bit read for temp.x and temp.y, and a 32-bit read for temp.z.
One way to force the compiler to issue a 128-bit read is to do:
There are several ways to go about loading float3s with coalescing. One would be to use smem as follows.
Treat the array in global memory, as well as shared memory as arrays of floats, not float3s. When your kernel is moving data from gmem to smem, each thread will perform 3 reads of scalar floats. However, the 2nd read will be (#threads/block) floats away from the first one, the 3nd one will be 2*(#threads/block) away from the first one. Thus, each of the three reads will be coalesced.
When processing the data inside the kernel, a thread can grab its piece of data by casting the smem array to float3 type. Compute code doesn’t change from that point.
Writing the result back to gmem uses the same approach as reading.
This may seem convoluted at first, but only the gmem access code changes, the rest is the same. And the performance is equal to that of a coalesced transfers (all the reads/writes are coalesced, after all). Below are the uncoalesced and coalesced code samples (the second one is hardcoded for the assumption that there are 256 threads per block):