wot[3008gridDim.x(kf-1)+gridDim.xcount1+blkidblockDim.x+thid]=buf[thid];
i have some problem like this . wot is global memory and buf is share memory of 128 float size.
3008gridDim.x(kf-1) is start of access address an it should always be multipled of region size. Because of 3008 means 9432.
count1 always be n64 also be the multiled of regionsize. blockDim.x =64
know it hase been looked like the access to globale memory wot is coalsced.
but the cudaprof rebport there is uncoalesced access?
can any one help me poing out what wrong with this access?