I need to copy 3D coordinates from global to shared memory.
My first try was:
float3 global; float3 shared; shared[tid] = global[tid];
(tid being thread id)
This is not good because each thread accesses three 32 bit values in global memory, so no memory access coalescing.
If I understand the programming guide I should use 128 bit types instead.
So I did:
float4 global; float3 shared; float 4 temp; temp = global[tid]; shared[tid].x = temp.x; shared[tid].y = temp.y; shared[tid].z = temp.z;
But If I look at the .ptx that is produced, the compiler does not issue a single 128bit read for ‘temp’, but a 64-bit read for temp.x and temp.y, and a 32-bit read for temp.z.
One way to force the compiler to issue a 128-bit read is to do:
float4 global; float4 shared; shared[tid] = global[tid];
1/that is a waste of shared memory
2/useless writes to shared memory occur
3/the size of float4 is not good to avoid bank conflicts (right ?)
Right now my only solution seems to use the second solution and to fix the .ptx file by hand.
Is there a better way ?
Or is the .ptx going to be re-optimized later and issue 128-bits read even if the compiler used a 64bit plus a 32bit read ?