kernel wrong execution ... when transforming a matrix into an array

Hello there.

I have this problem for a while, so I present it to you now maybe you can help me.

  • error is a matrix that keeps some values

  • cud_err is the matrix on the device (which I can assure it contains the right values)

Well after I execute a kernel of my own, in matrix cud_err I have some values that I want to sum.
If I copy those values back to host (error matrix) and do a for(i) for(j) sum += error[i][j] I get to the wanted result.

Instead I want to transform that matrix into an array for better personal use.

I use another array that contains as many elements as the number of blocks I use in the transformation kernel(each element of the array contains the sum of the
cud_error matrix’s elements from one block)

For this I use the code:

// initialize the host array
for(int i=0 ; i<resultArrayDim ; i++)
result[i] = 0;

// copy the host values of the array into the device one
cudaMemcpy(cud_result, result, resultArrayDim * sizeof(uint64_t), cudaMemcpyHostToDevice);

// execute the transforming kernel
addKernel<<<grid, threads, 0>>>(cud_err, cud_result, h, w); // grid = 1 and threads = 256 as example

// copy back to the host the values of the device array
cudaMemcpy(result, cud_result, resultArrayDim * sizeof(uint64_t), cudaMemcpyDeviceToHost);

//compute the sum of all elements
uint64_t finalResult = 0;
for(int i=0 ; i<resultArrayDim ; i++)
finalResult += result[i];

The addKernel looks like this:
!! Forgot to say that in the device, the representation of the error matrix is kept as an array (cud_err_array[tidx * w + tidy]; you know what I mean, as in most picture examples)

const unsigned int tidx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int tidy = blockIdx.y * blockDim.y + threadIdx.y;
if(tidx < h && tidy < w) // where tidx goes from 0 to the height of a picture let’s say and tidy from 0 to width of the same picture
unsigned int i;
uint64_t colsSum = 0;
if(threadIdx.x == 0) // computation is done only for the elements on the first line

	i = tidx * w + tidy;

           // I compute the sum of all elements of the matrix on columns for each block
	for(int j=0 ; j<blockDim.x ; j++)
		if(i + j * w < h * w)
			colsSum += cud_err_array[i + j * w];
	cud_result[tidy] += colsSum;


As I expect you understood what is wrong, my problem is that I get a different result in finalResult as in the case I would computed it from the error matrix.

If any questions, please…

Thanks a lot for any help!!

P.S.: Sorry, I forgot to say things about my system, even if I am aware of the fact that it does not matter so much. Well, I use CUDA 2.0. and I have a GeForce 8600M GT device.
I did not mentioned this earlier because the problem mentioned also appears in emulation mode.