Problem about Coalesced Access

wot[3008gridDim.x(kf-1)+gridDim.xcount1+blkidblockDim.x+thid]=buf[thid];

i have some problem like this . wot is global memory and buf is share memory of 128 float size.
3008gridDim.x(kf-1) is start of access address an it should always be multipled of region size. Because of 3008 means 9432.
count1 always be n
64 also be the multiled of regionsize. blockDim.x =64
know it hase been looked like the access to globale memory wot is coalsced.
but the cudaprof rebport there is uncoalesced access?
can any one help me poing out what wrong with this access?

That looks like it should coalesce to me…you might want to check that thid and count1 are being set correctly. Also, just to clarify: buf and wot are both float4’s, right? Otherwise I don’t things will coalesce.