Hi all,
I am applying a matrix  vector multiplication , Ax = b for very small matrices (matrix size 128 x 128). My idea was to load the whole vector x into the shared memory, perform a scalar product , i.e. A(i,1:N) * x = b(i), using the tree like reduction (similar to the sample code ). Thus, every block is responsible for computing a single element of b, and every thread within a block is responsible for multiplying the elements.
If I use the tree  like  reduction I do get error peaks which do not occur if I calculate the sum of the multiplied values in a nonparallel way. I was not able to find the mistake.

Where is the bug?

Does my data  thread correspondence make sense ?
__global__ void
smallMatVecMul_kernel( float* global_A, float* global_x, float* global_b)
{
const int VEC_LENGTH = blockDim.x;
int thx = threadIdx.x;
int blx = blockIdx.x;
int index = __mul24(thx,VEC_LENGTH) + blx;
extern __shared__ float shared_h[];
float* shared_x = shared_h + VEC_LENGTH;
int local_i = 128;
float local_sum = 0;
shared_x[thx] = global_x[thx];
__syncthreads();
shared_h[thx] = global_A[index] * shared_x[thx];
Tree  like reduction, for reduction i must be a power of two
for(int stride = local_i / 2; stride > 0; stride >>= 1){
__syncthreads();
if (thx < local_i )
shared_h[thx] += shared_h[stride + thx];
}
if ( thx == 0){
local_sum = shared_h[0];
for (int j = local_i; j < VEC_LENGTH; j++)
local_sum += shared_h[j];
global_b[blx] = local_sum;
}
thx for the help.
Cem