GPU calculate one time less than CPU

Hi,

I tried simple programs comparing GPU calculation with CPU.

When I tried

[codebox]if (idx < N) a[idx] = 2 + a[idx];

for (int i = 0; i < (N * (N/8)); i++)

{

	if (idx < N) a[idx] = a[idx]* (1 + (((float)i)/N));

    

}

[/codebox]

I got the correct results for a[idx]

But when I try

[codebox]if (idx < N) a[idx] = 1 + a[idx];

for (int i = 0; i < (N * (N/8)); i++)

{

	if(idx < N) 

		//a[idx] = a[idx] + a[idx] * ((float)i/N);	

		a[idx] = a[idx] + a[idx] * (1 + ((float)i) * (1e-4));				

}	[/codebox]

I always have the first element a[0] loop one time less. Other elements from a[1] to a[end] are all correct.

Does anyone know what’s wrong with my code?

I did the same calculation on CPU too.

Thanks,

Yuping

It seems the first element always becomes 0 when copied from host to device. Why is this happening?