Small SGEMM Tree-like-reduction

Hi all,

I am applying a matrix - vector multiplication , Ax = b for very small matrices (matrix size 128 x 128). My idea was to load the whole vector x into the shared memory, perform a scalar product , i.e. A(i,1:N) * x = b(i), using the tree like reduction (similar to the sample code ). Thus, every block is responsible for computing a single element of b, and every thread within a block is responsible for multiplying the elements.

If I use the tree - like - reduction I do get error peaks which do not occur if I calculate the sum of the multiplied values in a non-parallel way. I was not able to find the mistake.

  • Where is the bug?

  • Does my data - thread correspondence make sense ?

__global__ void

smallMatVecMul_kernel( float* global_A, float* global_x, float* global_b) 



	const int VEC_LENGTH = blockDim.x;

	int thx = threadIdx.x;

	int blx = blockIdx.x;

	int index = __mul24(thx,VEC_LENGTH) + blx;

	extern __shared__ float shared_h[];

	float* shared_x = shared_h + VEC_LENGTH;

	int local_i = 128;

	float local_sum = 0;

	shared_x[thx] = global_x[thx];


	shared_h[thx] = global_A[index] * shared_x[thx];

 Tree - like reduction, for reduction i must be a power of two

	for(int stride = local_i / 2; stride > 0; stride >>= 1){


  if (thx < local_i )

  	shared_h[thx] += shared_h[stride + thx];


	if ( thx == 0){

  local_sum = shared_h[0];

  for (int j = local_i; j < VEC_LENGTH; j++)

  	local_sum += shared_h[j];

         global_b[blx] = local_sum;


thx for the help.


Any chance you would be willing to post a self-contained compilable example?

of course …

Within the archive there is a matlab file which generates some random matrices:

  • choose destination folder and matrix size (1 < m_s < 128) , compareOutput-MatVec.m

  • after random matrix generation execute the smallMatVec,

  • activate the rest of the matlab file, i.e. read from the output, calculate error, and print

hope this helps.

smallMatVec.rar (1.34 MB)