# Matrix by Vector multiplication

Hey, I am trying to implement fast matrix by vector multiplication:

y = Ax

where:

y - output vector of size m

A - input matrix of size m x n

x - input vector of size n

I don’t really understand the cuda-code at the last 2 pages so I focused on the pseudocode on page 6. It’s quite clear so I tried to implement it on CUDA on my own. The result is this kernel: (for now for simplicity I assume that m-size of matrix is multiply of 64, and n-size of matrix is multiply of 8 - lets just assume it for now):

``````// matrix x vector multiplication

//

// 1 thread block for every 64 rows of matrix

// 64 * 8 threads per block

//

// y - output vector (m)

// A - input matrix (m x n)

// x - input vector (n)

// m - A.rows_num

// n - A.cols_num

__global__ void mv_kernel(float* y, float* A, float* x, int m, int n) {

__shared__ float P;

int h = blockIdx.x;

P[i][j] = 0;

int h_64 = (h<<6);				  // h_64 = h*64

for(int w = 0; w < (n>>3); ++w) { // n>>3 = n/8

int w_8 = (w<<3);			  // w_8 = w*8

P[i][j] += A[ ((h_64) + i)*n + ((w_8) + j) ] * x[ (w_8) + j ]; // calculate 8 sub-sums for every row /// !!!!!!!!

}

if( 0 == j ) {					// do it only once per row

float sum = 0;				// init sum to 0

for(int k = 0; k < 8; ++k) {

sum += P[i][k];			// sum the remaining 8 sub-sums

}

y[ (h_64) + i ] = sum;		// save the result

}

}
``````

This kernel works fine, but I want to know how I can speed up this algorithm… The code in this paper I lniked above look insane for me.

For now my only idea is to do something with line where “!!!” are. It looks like the x vector is readed 64 times from global memory by each thread block…

But I dont know what to do with it, since 1 thread block needs the whole x vector to calculate the sums, and x vector size may be greater than shared memory limit.

Maybe some1 have some idea?

Hmm strange thing - I modified my kernel to let more threads calculate the sum but in fact kernel always is slower now…

``````// matrix x vector multiplication

//

// 1 thread block for every 64 rows of matrix

// 64 * 8 threads per block

//

// y - output vector (m)

// A - input matrix (m x n)

// x - input vector (n)

// m - A.rows_num

// n - A.cols_num

__global__ void mv_kernel(float* y, float* A, float* x, int m, int n) {

__shared__ float P;

int h = blockIdx.x;

P[i][j] = 0;

int h_64 = (h<<6);				  // h_64 = h*64

for(int w = 0; w < (n>>3); ++w) { // n>>3 = n/8

int w_8 = (w<<3);			  // w_8 = w*8

P[i][j] += A[ ((h_64) + i)*n + ((w_8) + j) ] * x[ (w_8) + j ]; // calculate 8 sub-sums for every row /// !!!!!!!!

}

/// EDIT FROM HERE -> NEW CODE BELOW

for(int s = 1; s < 8; s<<=1) {

int index = 2*s*j;

if( index < 8 ) {

P[i][index] += P[i][index+s];

}

}

if( 0 == j) {

y[ (h_64) + i ] = P[i];

}

}
``````

Some1 knows why?

Hmm very strange thing my algo works fine for any m and n = { 64, 128, 256 }

But for n greater than 256 -> 512 / 1024 /2048 the result is incorrect, but it’s almost correct:

for example: the result should be 44608184 and it is 44608256 - it’s strange because in my matrix and vector I have only integers… so I think it’s not the problem of accuracy of floating point operation - so what is it?

And it happens for both versions of my kernel…

What’s going on - for n = 64 / 128 / 256 the result is 100% correct.

For n = 512 the difference between result and correct_result is 72
For n = 1024 the difference is 256
For n = 2048 the difference is 2 816
For n = 4096 the difference is 14 336

The error starts to appear at n = 8*47 and greater… maybe it’s kinf of overflow…?

Or maybe i’ts somthing with alligment??? I alloc matrix A as cudaMalloc(&A,mnsizeof(float)) ?

So the error is growing - how is it possible?

What’s going on - for n = 64 / 128 / 256 the result is 100% correct.

For n = 512 the difference between result and correct_result is 72
For n = 1024 the difference is 256
For n = 2048 the difference is 2 816
For n = 4096 the difference is 14 336

The error starts to appear at n = 8*47 and greater… maybe it’s kinf of overflow…?

Or maybe i’ts somthing with alligment??? I alloc matrix A as cudaMalloc(&A,mnsizeof(float)) ?

So the error is growing - how is it possible?

Here is the full source code, commented ->

mv.cu (5.12 KB)

-> can som1 tell me why the result is:

``````m = 64, n varies

for the m = 64 and n = 64 the error( y - desired_output ) = 0.000000

for the m = 64 and n = 128 the error( y - desired_output ) = 0.000000

for the m = 64 and n = 256 the error( y - desired_output ) = 0.000000

for the m = 64 and n = 368 the error( y - desired_output ) = 0.000000

for the m = 64 and n = 376 the error( y - desired_output ) = 4.000000

for the m = 64 and n = 384 the error( y - desired_output ) = 8.000000

for the m = 64 and n = 512 the error( y - desired_output ) = 72.000000

for the m = 64 and n = 1024 the error( y - desired_output ) = 256.000000

for the m = 64 and n = 2048 the error( y - desired_output ) = -2816.000000

for the m = 64 and n = 4096 the error( y - desired_output ) = -14336.000000

m = 1024, n varies

for the m = 1024 and n = 64 the error( y - desired_output ) = 0.000000

for the m = 1024 and n = 128 the error( y - desired_output ) = 0.000000

for the m = 1024 and n = 256 the error( y - desired_output ) = 0.000000

for the m = 1024 and n = 368 the error( y - desired_output ) = 0.000000

for the m = 1024 and n = 376 the error( y - desired_output ) = 4.000000

for the m = 1024 and n = 384 the error( y - desired_output ) = 8.000000

for the m = 1024 and n = 512 the error( y - desired_output ) = 72.000000

for the m = 1024 and n = 1024 the error( y - desired_output ) = 256.000000

for the m = 1024 and n = 4096 the error( y - desired_output ) = -14336.000000

How it is possible?
``````

Here is the full source code, commented -> [attachment=23855:mv.cu] -> can som1 tell me why the result is:

``````m = 64, n varies

for the m = 64 and n = 64 the error( y - desired_output ) = 0.000000

for the m = 64 and n = 128 the error( y - desired_output ) = 0.000000

for the m = 64 and n = 256 the error( y - desired_output ) = 0.000000

for the m = 64 and n = 368 the error( y - desired_output ) = 0.000000

for the m = 64 and n = 376 the error( y - desired_output ) = 4.000000

for the m = 64 and n = 384 the error( y - desired_output ) = 8.000000

for the m = 64 and n = 512 the error( y - desired_output ) = 72.000000

for the m = 64 and n = 1024 the error( y - desired_output ) = 256.000000

for the m = 64 and n = 2048 the error( y - desired_output ) = -2816.000000

for the m = 64 and n = 4096 the error( y - desired_output ) = -14336.000000

m = 1024, n varies

for the m = 1024 and n = 64 the error( y - desired_output ) = 0.000000

for the m = 1024 and n = 128 the error( y - desired_output ) = 0.000000

for the m = 1024 and n = 256 the error( y - desired_output ) = 0.000000

for the m = 1024 and n = 368 the error( y - desired_output ) = 0.000000

for the m = 1024 and n = 376 the error( y - desired_output ) = 4.000000

for the m = 1024 and n = 384 the error( y - desired_output ) = 8.000000

for the m = 1024 and n = 512 the error( y - desired_output ) = 72.000000

for the m = 1024 and n = 1024 the error( y - desired_output ) = 256.000000

for the m = 1024 and n = 4096 the error( y - desired_output ) = -14336.000000

How it is possible?
``````

Ok I figured out which piece of code acts like it shouldn’t. Look at those 2 kernels:

``````__global__ void kernel_1(float* output)

{

int n = 8 * 48; // = 384

float sum = 0.0f;

for(int j = 0; j < 8; ++j) // j = 0...7

{

for(int w = 0; w < (n/8); ++w) // w = 0...47

{

int s = ((w*8)+j);   // s = 0...383

sum += (float)(s*s); // sum += s^2

}

}

*output = sum;

}
``````
``````__global__ void kernel_2(float* output)

{

int j = threadIdx.y; // j = 0...7

int n = 8 * 48; // = 384

float sum = 0.0f;

__shared__ float sums;

sums[j] = 0.0f;

for(int w = 0; w < (n/8); ++w) // w = 0..47

{

int s = ((w*8)+j);	  // s = 0...383

sums[j] += (float)(s*s); // sum += s^2

}

if( 0 == j ) // do it only once for block

{

for(int k = 0; k < 8; ++k)

{

sum += sums[k];

}

*output = sum;

}

}
``````

the first one is executed like this:

``````dim3 blocks(1,1);

``````

the second one like this:

``````dim3 blocks(1,1);

``````

these kernels should do the same thing - I just got rid of one for loop from first kernel and changed it to 8 threads… This for loop can be done in parrarel

and don’t need to be synchronized…

the outputs from those kernels are different:

kernel1 = 18800684

kernel2 = 18800704

How it can be???

Ok I figured out which piece of code acts like it shouldn’t. Look at those 2 kernels:

``````__global__ void kernel_1(float* output)

{

int n = 8 * 48; // = 384

float sum = 0.0f;

for(int j = 0; j < 8; ++j) // j = 0...7

{

for(int w = 0; w < (n/8); ++w) // w = 0...47

{

int s = ((w*8)+j);   // s = 0...383

sum += (float)(s*s); // sum += s^2

}

}

*output = sum;

}
``````
``````__global__ void kernel_2(float* output)

{

int j = threadIdx.y; // j = 0...7

int n = 8 * 48; // = 384

float sum = 0.0f;

__shared__ float sums;

sums[j] = 0.0f;

for(int w = 0; w < (n/8); ++w) // w = 0..47

{

int s = ((w*8)+j);	  // s = 0...383

sums[j] += (float)(s*s); // sum += s^2

}

if( 0 == j ) // do it only once for block

{

for(int k = 0; k < 8; ++k)

{

sum += sums[k];

}

*output = sum;

}

}
``````

the first one is executed like this:

``````dim3 blocks(1,1);

``````

the second one like this:

``````dim3 blocks(1,1);

``````

these kernels should do the same thing - I just got rid of one for loop from first kernel and changed it to 8 threads… This for loop can be done in parrarel

and don’t need to be synchronized…

the outputs from those kernels are different:

kernel1 = 18800684

kernel2 = 18800704

How it can be???

okay maybe I am stupid but -> WHY?

``````#include <stdio.h>

#include <stdlib.h>

int main(int argc, char* argv[])

{

int n = 48*8;

float sum = 0.0f;

for(int w = 0; w < (n/8); ++w) {

for(int j = 0; j < 8; ++j) {

int s = ((8*w)+j);

sum += (float)(s*s);

}

}

printf("%f\n",sum);

sum = 0.0f;

for(int j = 0; j < 8; ++j) {

for(int w = 0; w < (n/8); ++w) {

int s = ((8*w)+j);

sum += (float)(s*s);

}

}

printf("%f\n",sum);

system("PAUSE");

return 0;

}
``````

the result is:

``````18800696.000000

18800684.000000
``````

I just changed order of for loops?

I think it’s the end of the world…

it must be some overflow… -> whats the highest float value ?

no it can’t be overflow… the max float is very huge like bilions of trilions…

and 384384384 it’s just 56 milions… WTF?

even if I change int to unsigned int the result is the same

ok I think it’s solved - when I used double instead of float it works - > but why float doesn’t work?

okay maybe I am stupid but -> WHY?

``````#include <stdio.h>

#include <stdlib.h>

int main(int argc, char* argv[])

{

int n = 48*8;

float sum = 0.0f;

for(int w = 0; w < (n/8); ++w) {

for(int j = 0; j < 8; ++j) {

int s = ((8*w)+j);

sum += (float)(s*s);

}

}

printf("%f\n",sum);

sum = 0.0f;

for(int j = 0; j < 8; ++j) {

for(int w = 0; w < (n/8); ++w) {

int s = ((8*w)+j);

sum += (float)(s*s);

}

}

printf("%f\n",sum);

system("PAUSE");

return 0;

}
``````

the result is:

``````18800696.000000

18800684.000000
``````

I just changed order of for loops?

I think it’s the end of the world…

it must be some overflow… -> whats the highest float value ?

no it can’t be overflow… the max float is very huge like bilions of trilions…

and 384384384 it’s just 56 milions… WTF?

even if I change int to unsigned int the result is the same

ok I think it’s solved - when I used double instead of float it works - > but why float doesn’t work?