Hi, I experienced very strange behavior of writing to global mem inside the kernel.
below is my kernel look like.
__globla kernel( float d_im)
{
float m1=0, m2 =0, t1=0, t2=0, t3=0;
//do some thing to calculate correlation of sub area.
m1=t3/sqrt(t1t2);
//m1 = 1234567897.0f
int tidj = threadIdx.x+blockIdx.xblockDim.x;
int tidi= threadIdx.y+blockIdx.yblockDim.y;
d_im[tidj+tidiblockDim.x*gridDim.x] = m1;
}
t1, t2 are variance of two sub arrays I caluated within the kernel, t3 is the covariance between two sub arrays.
1, if I calculate the correlation as m1 = t3/sqrt(t1*t2), my kernel take 50 ms, which is 6 times faster than cpu version.
2, If I just give some number to m1=12345454646.f, then my kernel only take 4ms, which is 80 times faster than cpu version.
3, if I comment out the writing to d_im, the kernel also take 4ms.
Based on 2 and 3, I believed my global memory is coalesed.
But when I write my computed number to global memory, the kernel becomes 10 times slower. There seems my computed t1,t2,t3 have some magic.
Can somebody help me to figure out the problem?