Hey, I am trying to implement fast matrix by vector multiplication:
y = Ax
where:
y - output vector of size m
A - input matrix of size m x n
x - input vector of size n
I read this article/publication - http://www.worldscinet.com/ppl/mkt/preserv…26408003545.pdf
I don’t really understand the cuda-code at the last 2 pages so I focused on the pseudocode on page 6. It’s quite clear so I tried to implement it on CUDA on my own. The result is this kernel: (for now for simplicity I assume that m-size of matrix is multiply of 64, and n-size of matrix is multiply of 8 - lets just assume it for now):
// matrix x vector multiplication
//
// 1 thread block for every 64 rows of matrix
// 64 * 8 threads per block
//
// y - output vector (m)
// A - input matrix (m x n)
// x - input vector (n)
// m - A.rows_num
// n - A.cols_num
__global__ void mv_kernel(float* y, float* A, float* x, int m, int n) {
__shared__ float P[64][8];
int i = threadIdx.x;
int j = threadIdx.y;
int h = blockIdx.x;
P[i][j] = 0;
int h_64 = (h<<6); // h_64 = h*64
for(int w = 0; w < (n>>3); ++w) { // n>>3 = n/8
int w_8 = (w<<3); // w_8 = w*8
P[i][j] += A[ ((h_64) + i)*n + ((w_8) + j) ] * x[ (w_8) + j ]; // calculate 8 sub-sums for every row /// !!!!!!!!
}
__syncthreads(); // wait till all subsums are ready
if( 0 == j ) { // do it only once per row
float sum = 0; // init sum to 0
for(int k = 0; k < 8; ++k) {
sum += P[i][k]; // sum the remaining 8 sub-sums
}
y[ (h_64) + i ] = sum; // save the result
}
}
This kernel works fine, but I want to know how I can speed up this algorithm… The code in this paper I lniked above look insane for me.
For now my only idea is to do something with line where “!!!” are. It looks like the x vector is readed 64 times from global memory by each thread block…
But I dont know what to do with it, since 1 thread block needs the whole x vector to calculate the sums, and x vector size may be greater than shared memory limit.
Maybe some1 have some idea?
Hmm strange thing - I modified my kernel to let more threads calculate the sum but in fact kernel always is slower now…
// matrix x vector multiplication
//
// 1 thread block for every 64 rows of matrix
// 64 * 8 threads per block
//
// y - output vector (m)
// A - input matrix (m x n)
// x - input vector (n)
// m - A.rows_num
// n - A.cols_num
__global__ void mv_kernel(float* y, float* A, float* x, int m, int n) {
__shared__ float P[64][8];
int i = threadIdx.x;
int j = threadIdx.y;
int h = blockIdx.x;
P[i][j] = 0;
int h_64 = (h<<6); // h_64 = h*64
for(int w = 0; w < (n>>3); ++w) { // n>>3 = n/8
int w_8 = (w<<3); // w_8 = w*8
P[i][j] += A[ ((h_64) + i)*n + ((w_8) + j) ] * x[ (w_8) + j ]; // calculate 8 sub-sums for every row /// !!!!!!!!
}
__syncthreads(); // wait till all subsums are ready
/// EDIT FROM HERE -> NEW CODE BELOW
for(int s = 1; s < 8; s<<=1) {
int index = 2*s*j;
if( index < 8 ) {
P[i][index] += P[i][index+s];
}
__syncthreads();
}
if( 0 == j) {
y[ (h_64) + i ] = P[i][0];
}
}
Some1 knows why?
Hmm very strange thing my algo works fine for any m and n = { 64, 128, 256 }
But for n greater than 256 → 512 / 1024 /2048 the result is incorrect, but it’s almost correct:
for example: the result should be 44608184 and it is 44608256 - it’s strange because in my matrix and vector I have only integers… so I think it’s not the problem of accuracy of floating point operation - so what is it?
And it happens for both versions of my kernel…