hwo to make float2 and float4 data coalesced?

In cuda programming guide, they said the device can read 32-bit 64-bit,or 128-bit wrods from global memory into registers in a single instruction.
So i try to make a test, with the following code
typede sturct __align(16)
float coeff[4];

global void
float4_coalesced(elementodata, elmentidata)
shared float5 sdata[BLOCK_DIM];
int index = blockIdx.x * blockDim.x+ threadIdx.x;
int tx = threadIdx.x;
sdata[tx] = idata[index];
odata[index] = sdata[tx];
i thought it should be coalesced,however when i use the cuda visual profiler to test the result is not coherent.
can any one tell me why? and how can I to make it coalseced except the following method:
shared float s_data[BLOCK_DIM4];
s_data[threadIdx.x] = idata[index];
s_data[threadIdx.x+BLOCK_DIM] = idata[index+BLOCK_DIM];
BLOCK_DIM] = idata[index+2BLOCK_DIM];
BLOCK_DIM] = idata[index+3*BLOCK_DIM];

Can you dump the PTX for this kernel? You’ve defined your element struct as a float[4] array, whereas the float4 struct is defined to be:

struct __align__(16) float4


  float x, y, z, w;


While both structs should occupy the same memory layout, they aren’t semantically the same, and it is possible that the compiler is doing something funny.