strage low of writing global mem

Hi, I experienced very strange behavior of writing to global mem inside the kernel.

below is my kernel look like.

__globla kernel( float d_im)
{
float m1=0, m2 =0, t1=0, t2=0, t3=0;

//do some thing to calculate correlation of sub area.

m1=t3/sqrt(t1t2);
//m1 = 1234567897.0f
int tidj = threadIdx.x+blockIdx.x
blockDim.x;
int tidi= threadIdx.y+blockIdx.yblockDim.y;
d_im[tidj+tidi
blockDim.x*gridDim.x] = m1;

}

t1, t2 are variance of two sub arrays I caluated within the kernel, t3 is the covariance between two sub arrays.
1, if I calculate the correlation as m1 = t3/sqrt(t1*t2), my kernel take 50 ms, which is 6 times faster than cpu version.
2, If I just give some number to m1=12345454646.f, then my kernel only take 4ms, which is 80 times faster than cpu version.
3, if I comment out the writing to d_im, the kernel also take 4ms.

Based on 2 and 3, I believed my global memory is coalesed.
But when I write my computed number to global memory, the kernel becomes 10 times slower. There seems my computed t1,t2,t3 have some magic.

Can somebody help me to figure out the problem?

The magic is that the compiler aggressively removes unused code. So if no result based on the calculation is ever written to global memory (your cases 2 and 3), the compiler removes the entire calculation including all memory transfers.

thank for you reply. However, I don’t understand well.

When I do below:

m1 = t3/sqrt(t1*t2);
m1 = 12345.0f
d_im[index] = m1;

Here I calculate m1 twice ( one is corelation, another is constant)
What will happen from the compiler side? You mean compiler is smart to tell I did not really do anything, so it even does not compute t1,t2 and t3.
Is this what you mean?
thanks

Baoyun

One more question,
If I set m1=12345.0f;

then I can check the memory transfer is done correctly. I see on the host data, everthing is 12345.

Thanks

Yes.

Good. What is your question though?