Hi, I experienced very strange behavior of writing to global mem inside the kernel.

below is my kernel look like.

__globla kernel( float d_im)

{

float m1=0, m2 =0, t1=0, t2=0, t3=0;

//do some thing to calculate correlation of sub area.

m1=t3/sqrt(t1*t2);
//m1 = 1234567897.0f
int tidj = threadIdx.x+blockIdx.x*blockDim.x;

int tidi= threadIdx.y+blockIdx.y

*blockDim.y;*

d_im[tidj+tidiblockDim.x*gridDim.x] = m1;

d_im[tidj+tidi

}

t1, t2 are variance of two sub arrays I caluated within the kernel, t3 is the covariance between two sub arrays.

1, if I calculate the correlation as m1 = t3/sqrt(t1*t2), my kernel take 50 ms, which is 6 times faster than cpu version.

2, If I just give some number to m1=12345454646.f, then my kernel only take 4ms, which is 80 times faster than cpu version.

3, if I comment out the writing to d_im, the kernel also take 4ms.

Based on 2 and 3, I believed my global memory is coalesed.

But when I write my computed number to global memory, the kernel becomes 10 times slower. There seems my computed t1,t2,t3 have some magic.

Can somebody help me to figure out the problem?