Hi,
I’m currently trying to implement the Reduction Sample into one of my projects and got stuck. I use the easiest version of the NVIDIA parallel_reduction.pdf and it worked when I used it to reduce an integer Array, but now I need it to reduce an Array of char-Arrays.
I reduced the code to a simple example which uses only one 1 block with 4 Threads. The input is an Array of 4 Charakter-arrays each with 9 chars (instead of using a 2D-Array I am using a 1D Array with 4*9 = 36 elements). Every Thread should add 2 rows of the array element by element (so result[0] = row1[0] + row2[0], result[1] = row1[1] + row2[1] and so on), i hope it becomes clear when you see the example below.
I don’t know how often I checked the indexes of the arrays, they seem to be correct. Also Thread No. 0 adds the two arrays correctly, but Thread No.2 seems to do nothing and I don’t know why. I would be happy if someone could tell me what I am missing.
Kernel Invocation:
add_results<<<1, 4>>>(device_results, device_results, 9);
add_results:
[codebox]global void add_results(unsigned char* results, unsigned char* in_results, int size)
{
extern __shared__ unsigned char s_results[];
int tid = threadIdx.x;
int i = (blockIdx.x * blockDim.x + tid) * size;
for (int j = 0; j < size; j++)
s_results[tid*size+j] = in_results[i + j];
__syncthreads();
for (int j = 1; j < blockDim.x; j *= 2)
{
if (tid % (2*j) == 0)
{
add_bin(&s_results[tid*size], &s_results[tid*size], &s_results[tid*size + j*size], size);
}
__syncthreads();
}
if (tid == 0)
{
for (int j = 0; j < size; j++)
{
results[blockIdx.x*size + j] = s_results[j];
}
}
}[/codebox]
“device_results” before Kernel Invocation:
[codebox]
0_0_1_1_1_1_0_0_0_
0_0_0_1_1_1_1_0_0_
0_0_0_0_1_1_1_1_0_
0_0_0_0_0_1_1_1_1_[/codebox]
“device_results” after the Kernel Invocation (as you can see Thread No.2 which should modify the third row did’n seem to do anything):
[codebox]0_0_1_2_3_3_2_1_0_
0_0_0_1_1_1_1_0_0_
0_0_0_0_1_1_1_1_0_
0_0_0_0_0_1_1_1_1_
expected:
0_0_1_2_3_4_3_2_1_
0_0_0_1_1_1_1_0_0_
0_0_0_0_1_2_2_2_1_
0_0_0_0_0_1_1_1_1_
[/codebox]
[codebox]device void add_bin(unsigned char* result, unsigned char* op1, unsigned char* op2, int size)
{
int tmp;
char carry = 0;
for (int i = size - 1; i >= 0; i--)
{
tmp = op1[i] + op2[i] + carry;
if ( tmp < 255 )
{
result[i] = tmp;
carry = 0;
}
else
{
result[i] = tmp - 256;
carry = 1;
}
}
}[/codebox]
Thank you!