hwo to make float2 and float4 data coalesced?

In cuda programming guide, they said the device can read 32-bit 64-bit,or 128-bit wrods from global memory into registers in a single instruction.
So i try to make a test, with the following code
typede sturct __align(16)
{
float coeff[4];
}element;

global void
float4_coalesced(elementodata, elmentidata)
{
shared float5 sdata[BLOCK_DIM];
int index = blockIdx.x * blockDim.x+ threadIdx.x;
int tx = threadIdx.x;
sdata[tx] = idata[index];
__syncthreads();
odata[index] = sdata[tx];
}
i thought it should be coalesced,however when i use the cuda visual profiler to test the result is not coherent.
can any one tell me why? and how can I to make it coalseced except the following method:
shared float s_data[BLOCK_DIM4];
s_data[threadIdx.x] = idata[index];
s_data[threadIdx.x+BLOCK_DIM] = idata[index+BLOCK_DIM];
s_data[threadIdx.x+2
BLOCK_DIM] = idata[index+2BLOCK_DIM];
s_data[threadIdx.x+3
BLOCK_DIM] = idata[index+3*BLOCK_DIM];
__syncthreads();

Can you dump the PTX for this kernel? You’ve defined your element struct as a float[4] array, whereas the float4 struct is defined to be:

struct __align__(16) float4

{

  float x, y, z, w;

};

While both structs should occupy the same memory layout, they aren’t semantically the same, and it is possible that the compiler is doing something funny.