 # Small SGEMM Tree-like-reduction

Hi all,

I am applying a matrix - vector multiplication , Ax = b for very small matrices (matrix size 128 x 128). My idea was to load the whole vector x into the shared memory, perform a scalar product , i.e. A(i,1:N) * x = b(i), using the tree like reduction (similar to the sample code ). Thus, every block is responsible for computing a single element of b, and every thread within a block is responsible for multiplying the elements.

If I use the tree - like - reduction I do get error peaks which do not occur if I calculate the sum of the multiplied values in a non-parallel way. I was not able to find the mistake.

• Where is the bug?

• Does my data - thread correspondence make sense ?

``````__global__ void

smallMatVecMul_kernel( float* global_A, float* global_x, float* global_b)

{

const int VEC_LENGTH = blockDim.x;

int blx = blockIdx.x;

int index = __mul24(thx,VEC_LENGTH) + blx;

extern __shared__ float shared_h[];

float* shared_x = shared_h + VEC_LENGTH;

int local_i = 128;

float local_sum = 0;

shared_x[thx] = global_x[thx];

shared_h[thx] = global_A[index] * shared_x[thx];

Tree - like reduction, for reduction i must be a power of two

for(int stride = local_i / 2; stride > 0; stride >>= 1){

if (thx < local_i )

shared_h[thx] += shared_h[stride + thx];

}

if ( thx == 0){

local_sum = shared_h;

for (int j = local_i; j < VEC_LENGTH; j++)

local_sum += shared_h[j];

global_b[blx] = local_sum;

}
``````

thx for the help.

Cem

Any chance you would be willing to post a self-contained compilable example?

of course …

Within the archive there is a matlab file which generates some random matrices:

• choose destination folder and matrix size (1 < m_s < 128) , compareOutput-MatVec.m

• after random matrix generation execute the smallMatVec, smallMatVec.cu

• activate the rest of the matlab file, i.e. read from the output, calculate error, and print

hope this helps.

Cem
smallMatVec.rar (1.34 MB)