simple cdot implementation beginner question

__global__ void

testKernel( float* g_idata, float* g_odata,int block_size_x, 

  	int block_size_y, int grid_size_x, int grid_size_y, const int N) 

{

	int index = blockIdx.x * block_size_x + threadIdx.x;

 if (index < N ) {

   g_odata[0] += g_idata[index] * g_idata[index];

  }

}

Why this is not working as expected ?

thx a lot

Because you have every single thread on the device trying to increment the same memory location simultaneously. Read-modify-write is not an atomic operation. See the scalarProd project in the SDK for a complete description of how to implement this operation. Or search the forums for the numerous other threads that discuss the read-modify-write issue in more detail.

Edit: Ack double post.

ahhh thanks a lot … i was wondering why it worked without the incementation.

i had the same problem some weeks ago, too. And for me this thread (especially the last answer) was very helpful.

http://forums.nvidia.com/index.php?showtopic=42642