Hello there.
I have this problem for a while, so I present it to you now maybe you can help me.
-
error is a matrix that keeps some values
-
cud_err is the matrix on the device (which I can assure it contains the right values)
Well after I execute a kernel of my own, in matrix cud_err I have some values that I want to sum.
If I copy those values back to host (error matrix) and do a for(i) for(j) sum += error[i][j] I get to the wanted result.
Instead I want to transform that matrix into an array for better personal use.
I use another array that contains as many elements as the number of blocks I use in the transformation kernel(each element of the array contains the sum of the
cud_error matrix’s elements from one block)
For this I use the code:
// initialize the host array
for(int i=0 ; i<resultArrayDim ; i++)
{
result[i] = 0;
}
// copy the host values of the array into the device one
cudaMemcpy(cud_result, result, resultArrayDim * sizeof(uint64_t), cudaMemcpyHostToDevice);
// execute the transforming kernel
addKernel<<<grid, threads, 0>>>(cud_err, cud_result, h, w); // grid = 1 and threads = 256 as example
// copy back to the host the values of the device array
cudaMemcpy(result, cud_result, resultArrayDim * sizeof(uint64_t), cudaMemcpyDeviceToHost);
//compute the sum of all elements
uint64_t finalResult = 0;
for(int i=0 ; i<resultArrayDim ; i++)
finalResult += result[i];
The addKernel looks like this:
!! Forgot to say that in the device, the representation of the error matrix is kept as an array (cud_err_array[tidx * w + tidy]; you know what I mean, as in most picture examples)
const unsigned int tidx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int tidy = blockIdx.y * blockDim.y + threadIdx.y;
if(tidx < h && tidy < w) // where tidx goes from 0 to the height of a picture let’s say and tidy from 0 to width of the same picture
{
unsigned int i;
uint64_t colsSum = 0;
if(threadIdx.x == 0) // computation is done only for the elements on the first line
{
i = tidx * w + tidy;
// I compute the sum of all elements of the matrix on columns for each block
for(int j=0 ; j<blockDim.x ; j++)
{
if(i + j * w < h * w)
{
colsSum += cud_err_array[i + j * w];
}
}
cud_result[tidy] += colsSum;
}
}
As I expect you understood what is wrong, my problem is that I get a different result in finalResult as in the case I would computed it from the error matrix.
If any questions, please…
Thanks a lot for any help!!
P.S.: Sorry, I forgot to say things about my system, even if I am aware of the fact that it does not matter so much. Well, I use CUDA 2.0. and I have a GeForce 8600M GT device.
I did not mentioned this earlier because the problem mentioned also appears in emulation mode.