Matrix by Vector multiplication

Hey, I am trying to implement fast matrix by vector multiplication:

y = Ax

where:

y - output vector of size m

A - input matrix of size m x n

x - input vector of size n

I read this article/publication - http://www.worldscinet.com/ppl/mkt/preserv…26408003545.pdf

I don’t really understand the cuda-code at the last 2 pages so I focused on the pseudocode on page 6. It’s quite clear so I tried to implement it on CUDA on my own. The result is this kernel: (for now for simplicity I assume that m-size of matrix is multiply of 64, and n-size of matrix is multiply of 8 - lets just assume it for now):

// matrix x vector multiplication

//

// 1 thread block for every 64 rows of matrix

// 64 * 8 threads per block

//

// y - output vector (m)

// A - input matrix (m x n)

// x - input vector (n)

// m - A.rows_num

// n - A.cols_num

__global__ void mv_kernel(float* y, float* A, float* x, int m, int n) {

	__shared__ float P[64][8];

	

	int i = threadIdx.x;

	int j = threadIdx.y;

	int h = blockIdx.x;

	P[i][j] = 0;

	int h_64 = (h<<6);				  // h_64 = h*64

	for(int w = 0; w < (n>>3); ++w) { // n>>3 = n/8

		int w_8 = (w<<3);			  // w_8 = w*8

		P[i][j] += A[ ((h_64) + i)*n + ((w_8) + j) ] * x[ (w_8) + j ]; // calculate 8 sub-sums for every row /// !!!!!!!!

	}

	__syncthreads();				// wait till all subsums are ready

	if( 0 == j ) {					// do it only once per row

		float sum = 0;				// init sum to 0

		for(int k = 0; k < 8; ++k) {

			sum += P[i][k];			// sum the remaining 8 sub-sums

		}

		y[ (h_64) + i ] = sum;		// save the result

	}

}

This kernel works fine, but I want to know how I can speed up this algorithm… The code in this paper I lniked above look insane for me.

For now my only idea is to do something with line where “!!!” are. It looks like the x vector is readed 64 times from global memory by each thread block…

But I dont know what to do with it, since 1 thread block needs the whole x vector to calculate the sums, and x vector size may be greater than shared memory limit.

Maybe some1 have some idea?

Hmm strange thing - I modified my kernel to let more threads calculate the sum but in fact kernel always is slower now…

// matrix x vector multiplication

//

// 1 thread block for every 64 rows of matrix

// 64 * 8 threads per block

//

// y - output vector (m)

// A - input matrix (m x n)

// x - input vector (n)

// m - A.rows_num

// n - A.cols_num

__global__ void mv_kernel(float* y, float* A, float* x, int m, int n) {

	__shared__ float P[64][8];

	

	int i = threadIdx.x;

	int j = threadIdx.y;

	int h = blockIdx.x;

	P[i][j] = 0;

	int h_64 = (h<<6);				  // h_64 = h*64

	for(int w = 0; w < (n>>3); ++w) { // n>>3 = n/8

		int w_8 = (w<<3);			  // w_8 = w*8

		P[i][j] += A[ ((h_64) + i)*n + ((w_8) + j) ] * x[ (w_8) + j ]; // calculate 8 sub-sums for every row /// !!!!!!!!

	}

	__syncthreads();				// wait till all subsums are ready

		/// EDIT FROM HERE -> NEW CODE BELOW

	for(int s = 1; s < 8; s<<=1) {

		int index = 2*s*j;

		

		if( index < 8 ) {

			P[i][index] += P[i][index+s];

		}

		__syncthreads();

	}

		if( 0 == j) {

			y[ (h_64) + i ] = P[i][0];

		}

}

Some1 knows why?

Hmm very strange thing my algo works fine for any m and n = { 64, 128, 256 }

But for n greater than 256 → 512 / 1024 /2048 the result is incorrect, but it’s almost correct:

for example: the result should be 44608184 and it is 44608256 - it’s strange because in my matrix and vector I have only integers… so I think it’s not the problem of accuracy of floating point operation - so what is it?

And it happens for both versions of my kernel…

What’s going on - for n = 64 / 128 / 256 the result is 100% correct.

For n = 512 the difference between result and correct_result is 72
For n = 1024 the difference is 256
For n = 2048 the difference is 2 816
For n = 4096 the difference is 14 336

The error starts to appear at n = 8*47 and greater… maybe it’s kinf of overflow…?

Or maybe i’ts somthing with alligment??? I alloc matrix A as cudaMalloc(&A,mnsizeof(float)) ?

So the error is growing - how is it possible?

What’s going on - for n = 64 / 128 / 256 the result is 100% correct.

For n = 512 the difference between result and correct_result is 72
For n = 1024 the difference is 256
For n = 2048 the difference is 2 816
For n = 4096 the difference is 14 336

The error starts to appear at n = 8*47 and greater… maybe it’s kinf of overflow…?

Or maybe i’ts somthing with alligment??? I alloc matrix A as cudaMalloc(&A,mnsizeof(float)) ?

So the error is growing - how is it possible?

Here is the full source code, commented →

mv.cu (5.12 KB)

→ can som1 tell me why the result is:

m = 64, n varies

for the m = 64 and n = 64 the error( y[0] - desired_output ) = 0.000000

for the m = 64 and n = 128 the error( y[0] - desired_output ) = 0.000000

for the m = 64 and n = 256 the error( y[0] - desired_output ) = 0.000000

for the m = 64 and n = 368 the error( y[0] - desired_output ) = 0.000000

for the m = 64 and n = 376 the error( y[0] - desired_output ) = 4.000000

for the m = 64 and n = 384 the error( y[0] - desired_output ) = 8.000000

for the m = 64 and n = 512 the error( y[0] - desired_output ) = 72.000000

for the m = 64 and n = 1024 the error( y[0] - desired_output ) = 256.000000

for the m = 64 and n = 2048 the error( y[0] - desired_output ) = -2816.000000

for the m = 64 and n = 4096 the error( y[0] - desired_output ) = -14336.000000

m = 1024, n varies

for the m = 1024 and n = 64 the error( y[0] - desired_output ) = 0.000000

for the m = 1024 and n = 128 the error( y[0] - desired_output ) = 0.000000

for the m = 1024 and n = 256 the error( y[0] - desired_output ) = 0.000000

for the m = 1024 and n = 368 the error( y[0] - desired_output ) = 0.000000

for the m = 1024 and n = 376 the error( y[0] - desired_output ) = 4.000000

for the m = 1024 and n = 384 the error( y[0] - desired_output ) = 8.000000

for the m = 1024 and n = 512 the error( y[0] - desired_output ) = 72.000000

for the m = 1024 and n = 1024 the error( y[0] - desired_output ) = 256.000000

for the m = 1024 and n = 4096 the error( y[0] - desired_output ) = -14336.000000

How it is possible?

Here is the full source code, commented → [attachment=23855:mv.cu] → can som1 tell me why the result is:

m = 64, n varies

for the m = 64 and n = 64 the error( y[0] - desired_output ) = 0.000000

for the m = 64 and n = 128 the error( y[0] - desired_output ) = 0.000000

for the m = 64 and n = 256 the error( y[0] - desired_output ) = 0.000000

for the m = 64 and n = 368 the error( y[0] - desired_output ) = 0.000000

for the m = 64 and n = 376 the error( y[0] - desired_output ) = 4.000000

for the m = 64 and n = 384 the error( y[0] - desired_output ) = 8.000000

for the m = 64 and n = 512 the error( y[0] - desired_output ) = 72.000000

for the m = 64 and n = 1024 the error( y[0] - desired_output ) = 256.000000

for the m = 64 and n = 2048 the error( y[0] - desired_output ) = -2816.000000

for the m = 64 and n = 4096 the error( y[0] - desired_output ) = -14336.000000

m = 1024, n varies

for the m = 1024 and n = 64 the error( y[0] - desired_output ) = 0.000000

for the m = 1024 and n = 128 the error( y[0] - desired_output ) = 0.000000

for the m = 1024 and n = 256 the error( y[0] - desired_output ) = 0.000000

for the m = 1024 and n = 368 the error( y[0] - desired_output ) = 0.000000

for the m = 1024 and n = 376 the error( y[0] - desired_output ) = 4.000000

for the m = 1024 and n = 384 the error( y[0] - desired_output ) = 8.000000

for the m = 1024 and n = 512 the error( y[0] - desired_output ) = 72.000000

for the m = 1024 and n = 1024 the error( y[0] - desired_output ) = 256.000000

for the m = 1024 and n = 4096 the error( y[0] - desired_output ) = -14336.000000

How it is possible?

Ok I figured out which piece of code acts like it shouldn’t. Look at those 2 kernels:

__global__ void kernel_1(float* output)

{

	int n = 8 * 48; // = 384

	float sum = 0.0f;

	for(int j = 0; j < 8; ++j) // j = 0...7

	{

		for(int w = 0; w < (n/8); ++w) // w = 0...47

		{

			int s = ((w*8)+j);   // s = 0...383

			sum += (float)(s*s); // sum += s^2

		}

	}

	*output = sum;

}
__global__ void kernel_2(float* output) 

{

	int j = threadIdx.y; // j = 0...7 

	int n = 8 * 48; // = 384

	float sum = 0.0f;

	__shared__ float sums[8];

	

	sums[j] = 0.0f;

	for(int w = 0; w < (n/8); ++w) // w = 0..47

	{

		int s = ((w*8)+j);	  // s = 0...383

		sums[j] += (float)(s*s); // sum += s^2

	}

	

	__syncthreads();

	if( 0 == j ) // do it only once for block

	{ 	

		for(int k = 0; k < 8; ++k)

		{

			sum += sums[k];

		}

		*output = sum;

	}

}

the first one is executed like this:

dim3 blocks(1,1);

dim3 threads(1,1,1);

kernel_1<<< blocks, threads >>>(output);

so i’ts just one thread…

the second one like this:

dim3 blocks(1,1);

dim3 threads(1,8,1);

kernel_1<<< blocks, threads >>>(output);

so 8 threads…

these kernels should do the same thing - I just got rid of one for loop from first kernel and changed it to 8 threads… This for loop can be done in parrarel

and don’t need to be synchronized…

the outputs from those kernels are different:

kernel1 = 18800684

kernel2 = 18800704

How it can be???

Ok I figured out which piece of code acts like it shouldn’t. Look at those 2 kernels:

__global__ void kernel_1(float* output)

{

	int n = 8 * 48; // = 384

	float sum = 0.0f;

	for(int j = 0; j < 8; ++j) // j = 0...7

	{

		for(int w = 0; w < (n/8); ++w) // w = 0...47

		{

			int s = ((w*8)+j);   // s = 0...383

			sum += (float)(s*s); // sum += s^2

		}

	}

	*output = sum;

}
__global__ void kernel_2(float* output) 

{

	int j = threadIdx.y; // j = 0...7 

	int n = 8 * 48; // = 384

	float sum = 0.0f;

	__shared__ float sums[8];

	

	sums[j] = 0.0f;

	for(int w = 0; w < (n/8); ++w) // w = 0..47

	{

		int s = ((w*8)+j);	  // s = 0...383

		sums[j] += (float)(s*s); // sum += s^2

	}

	

	__syncthreads();

	if( 0 == j ) // do it only once for block

	{ 	

		for(int k = 0; k < 8; ++k)

		{

			sum += sums[k];

		}

		*output = sum;

	}

}

the first one is executed like this:

dim3 blocks(1,1);

dim3 threads(1,1,1);

kernel_1<<< blocks, threads >>>(output);

so i’ts just one thread…

the second one like this:

dim3 blocks(1,1);

dim3 threads(1,8,1);

kernel_1<<< blocks, threads >>>(output);

so 8 threads…

these kernels should do the same thing - I just got rid of one for loop from first kernel and changed it to 8 threads… This for loop can be done in parrarel

and don’t need to be synchronized…

the outputs from those kernels are different:

kernel1 = 18800684

kernel2 = 18800704

How it can be???

okay maybe I am stupid but → WHY?

#include <stdio.h>

#include <stdlib.h>

int main(int argc, char* argv[])

{

	int n = 48*8;

	float sum = 0.0f;

	

	for(int w = 0; w < (n/8); ++w) {

	   for(int j = 0; j < 8; ++j) {

		  int s = ((8*w)+j);

		  sum += (float)(s*s);		

	   }		

	}

	printf("%f\n",sum);

	

	sum = 0.0f;

	

	for(int j = 0; j < 8; ++j) {

	   for(int w = 0; w < (n/8); ++w) {

		  int s = ((8*w)+j);

		  sum += (float)(s*s);		

	   }		

	}

	printf("%f\n",sum);

	

	system("PAUSE");

	

	return 0;   

}

the result is:

18800696.000000

18800684.000000

I just changed order of for loops?

I think it’s the end of the world…

it must be some overflow… → whats the highest float value ?

no it can’t be overflow… the max float is very huge like bilions of trilions…

and 384384384 it’s just 56 milions… WTF?

even if I change int to unsigned int the result is the same

ok I think it’s solved - when I used double instead of float it works - > but why float doesn’t work?

okay maybe I am stupid but → WHY?

#include <stdio.h>

#include <stdlib.h>

int main(int argc, char* argv[])

{

	int n = 48*8;

	float sum = 0.0f;

	

	for(int w = 0; w < (n/8); ++w) {

	   for(int j = 0; j < 8; ++j) {

		  int s = ((8*w)+j);

		  sum += (float)(s*s);		

	   }		

	}

	printf("%f\n",sum);

	

	sum = 0.0f;

	

	for(int j = 0; j < 8; ++j) {

	   for(int w = 0; w < (n/8); ++w) {

		  int s = ((8*w)+j);

		  sum += (float)(s*s);		

	   }		

	}

	printf("%f\n",sum);

	

	system("PAUSE");

	

	return 0;   

}

the result is:

18800696.000000

18800684.000000

I just changed order of for loops?

I think it’s the end of the world…

it must be some overflow… → whats the highest float value ?

no it can’t be overflow… the max float is very huge like bilions of trilions…

and 384384384 it’s just 56 milions… WTF?

even if I change int to unsigned int the result is the same

ok I think it’s solved - when I used double instead of float it works - > but why float doesn’t work?