Reduction Problem

Hi,

I’m currently trying to implement the Reduction Sample into one of my projects and got stuck. I use the easiest version of the NVIDIA parallel_reduction.pdf and it worked when I used it to reduce an integer Array, but now I need it to reduce an Array of char-Arrays.

I reduced the code to a simple example which uses only one 1 block with 4 Threads. The input is an Array of 4 Charakter-arrays each with 9 chars (instead of using a 2D-Array I am using a 1D Array with 4*9 = 36 elements). Every Thread should add 2 rows of the array element by element (so result[0] = row1[0] + row2[0], result[1] = row1[1] + row2[1] and so on), i hope it becomes clear when you see the example below.

I don’t know how often I checked the indexes of the arrays, they seem to be correct. Also Thread No. 0 adds the two arrays correctly, but Thread No.2 seems to do nothing and I don’t know why. I would be happy if someone could tell me what I am missing.

Kernel Invocation:

add_results<<<1, 4>>>(device_results, device_results, 9);

add_results:

[codebox]global void add_results(unsigned char* results, unsigned char* in_results, int size)

{

extern __shared__ unsigned char s_results[];

int tid = threadIdx.x;

int i = (blockIdx.x * blockDim.x + tid) * size;

for (int j = 0; j < size; j++)

	s_results[tid*size+j] = in_results[i + j];

__syncthreads();

for (int j = 1; j < blockDim.x; j *= 2)

{

	if (tid % (2*j) == 0)

	{

		add_bin(&s_results[tid*size], &s_results[tid*size], &s_results[tid*size + j*size], size);

	}

		

	__syncthreads();

}



if (tid == 0)

{

	for (int j = 0; j < size; j++)

	{

		results[blockIdx.x*size + j] = s_results[j];

	}

} 

}[/codebox]

“device_results” before Kernel Invocation:

[codebox]

0_0_1_1_1_1_0_0_0_

0_0_0_1_1_1_1_0_0_

0_0_0_0_1_1_1_1_0_

0_0_0_0_0_1_1_1_1_[/codebox]

“device_results” after the Kernel Invocation (as you can see Thread No.2 which should modify the third row did’n seem to do anything):

[codebox]0_0_1_2_3_3_2_1_0_

0_0_0_1_1_1_1_0_0_

0_0_0_0_1_1_1_1_0_

0_0_0_0_0_1_1_1_1_

expected:

0_0_1_2_3_4_3_2_1_

0_0_0_1_1_1_1_0_0_

0_0_0_0_1_2_2_2_1_

0_0_0_0_0_1_1_1_1_

[/codebox]

[codebox]device void add_bin(unsigned char* result, unsigned char* op1, unsigned char* op2, int size)

{

int tmp;

char carry = 0;

for (int i = size - 1; i >= 0; i--)

{

	tmp = op1[i] + op2[i] + carry;

	if ( tmp < 255 )

	{

		result[i] = tmp;

		carry = 0;

	}

	else

	{

		result[i] = tmp - 256;

		carry = 1;

	}

}

}[/codebox]

Thank you!

I found the Solution to my problem:
I wasn’t that thread No.2 didn’t work, it just had no input to work with. I forgot that I have to define the size of shared memory a kernel can use when I invoke it. Otherwise it seems that the kernel can only access 32 Byte, when I try to access Byte 33 I don’t get any error but the value is always 0.

So invoking the kernel with
add_results<<<1, 4, sizeof(char)* 9*4>>>(device_results, device_results, 9);
solved my problem.