# Fastest matrix-vector multiplication?

Hi,
I’m trying for perform a matrix vector multiplication with a matrix of size approx 5500x10800 and a vector of size 10800, single precision.
The best performance that I can achieve gives about 51 gflops, and a uses about 95 GB/s memory bandwidth (I store the vector in shared memory, and the matrix in device global memory - so reading the matrix is what consumes memory bandwidth).

I’m running this on a Tesla 2070 card - and my core gives about 20% better performance than cublas sgemv.

The question is, is there something that I am missing? Should I be able to get better performance still? I’m well below the max theoretical memory bandwidth (about 150GB/s).

Any tips for how I could go faster? Are there any examples that people know about that reach close to the max memory bandwidth?

Thanks…

I think you should still be able to get better results, though I’m not perfectly sure on that (no experience). Maybe you could show us your code and we’ll see what can be improved.

The code looks something like this:
#define N 10864
#define M 5672
global void mvKernel(float *a, float *x, float y){
//Performs y = a.x for matrix a (size MxN) and vector x (size N) and vector y (size M)
int i;
float tmp=0;
float *A;
int j=0;
shared float xs[N];
//copy x into shared memory
#pragma unroll
xs[i]=x[i];

A=&a[indx];
#pragma unroll 388//note 112 is a divisor of 10864 (no of slopes):2x2x2x2x7x97
for(i=0;i<N;i++){
tmp+=xs[i]*A[j];
j+=M;
}
y[indx]=tmp;
}

And gets called with:

Note, the matrix size (Da) is actually (numBlocksNUMTHREADS) x N - I overpad it so that I don’t have to put if statements in the kernel loop - and likewise Dy is of size numBlocksNUMTHREADS instead of M - but I’m only interested in the first M elements…

But it is faster to do a bit extra computation than to include if statements.

Any suggestions welcome…

cuBLAS is column major. I hope you are also using a col-major approach.
btw, Your code can run only in FERMI as it requires more than 40K of shared memory space…Hope you have enough threads in a single block to take care of latencies.

You can try this one that avvidday and I discussed a long time ago.

It achieved around 53.4 GB/s on an OLD GTS250 which is ( 53.4 / 70 => ~ 76% bandwidth utilization) and I’m guessing it would perform much better on a newer card, old thread:

It probably needs some modifications and fixing to perform well on a Fermi card, to begin with it was not very generic :)

Yes, that is essentially what I’m doing (as the example at the top of your post) - but copying the vector to shared memory first (which improves my performance), and using a different number of threads per block - I assume you’re using 32 from the code…

I’m just looking at the code from the bottom of your post (MY_GEMV.cu) - I’ll see how that does…

Thanks,

95GB/s is quite good already given that with your matrix size occupancy must be rather low. Particularly if you haven’t turned ECC off, I don’t think you can achieve any substantial gains over that (although I know that squeezing every GFLOP/s and every GB/s out of CUDA can sometimes be addictive).

You are missing a __syncthreads() after loading the vector to shared memory. And there is no point in going through shared memory if the data isn’t reused (you could have each thread compute multiple outputs to do so, but that would further decrease occupancy which already is quite low with your matrix size). The fact that you still achieve more that half peak bandwidth shows that L2 cache is doing a good job here.

Yes, I think the 32 threads per block is quite suboptimal for Fermi and would need some modification. Also my code assumes row-major storage which adds some extra overhead as to the in-place corner turn being performed, remove this and performance should increase…

I attempted to run it on my 460m a few minutes ago and discovered it needed some modifications to run well…

Yes, EEC is turned off…

I originally had a __syncthreads(), but removed it after finding that I still get the correct results without it (though maybe this is a dangerous assumption).

I agree with you that there should be no point going through shared memory, however, I found that using it does increase performance by about 30%. I think this is because without it, I am loading the vector and matrix from global memory at the same time, which means skipping round different memory locations. However, with it, I am first loading the vector then when I load the matrix, this is all from contiguous locations…

So what we saw before was that loading the x-vector (Ax = b) into constant memory had advantages for the GT200 architecture.

That sounds interesting. Does the 95 GB/s number include the bandwidth needed for (repeated) loading of the vector as well?

One thing you could try is to break the dependency chain through manual unrolling:

``````#define N 10864

#define M 5672

__global__ void mvKernel(float *a, float *x, float *y){

//Performs y = a.x for matrix a (size MxN) and vector x (size N) and vector y (size M)

int i;

float tmp1=0, tmp2=0, tmp3=0, tmp4=0;

float *A;

int j=0;

__shared__ float xs[N];

//copy x into shared memory

#pragma unroll

xs[i]=x[i];

A=&a[indx];

#pragma unroll 28//note 112 is a divisor of 10864 (no of slopes):2x2x2x2x7x97

for(i=0;i<N-3;){

tmp1+=xs[i++]*A[j];

j+=M;

tmp2+=xs[i++]*A[j];

j+=M;

tmp3+=xs[i++]*A[j];

j+=M;

tmp4+=xs[i++]*A[j];

j+=M;

}

for(i=0;i<N;){

tmp1+=xs[i++]*A[j];

j+=M;

}

y[indx]=tmp1+tmp2+tmp3+tmp4;

}
``````

That should give you a lot more outstanding memory transactions. Whether the memory controller is able to take advantage is anyone’s guess though. Would however be interesting to see if the effect of preloading to shared memory is still as large then.

Alternatively, the manual unrolling could be used to generate multiple outputs per thread. That would introduce some explicit reuse.

No - actually, the way I calculated this is that it takes about 2.4ms to do the kernel, so 1086456724/0.0024 / (102410241024) gives 95.6GB/s.

I suppose I should also include the time to load the vector into shared memory - which is loaded by 13 blocks, so 13 times, and also take into account the matrix padding, increasing the matrix size to 10864 x 5824. Taking all this into account gives 98GB/s.

I presume you meant to say:

``````#pragma unroll

for(i=0;i<N-4;){

tmp1+=xs[i++]*A[j];

j+=M;

tmp2+=xs[i++]*A[j];

j+=M;

tmp3+=xs[i++]*A[j];

j+=M;

tmp4+=xs[i++]*A[j];

j+=M;

}

y[indx]=tmp1+tmp2+tmp3+tmp4;

}
``````

(-4 not -3, and removed the final for loop at the end)

When I do this, I get about 91GB/s, so worse than the 98GB/s. However, I’m not entirely sure what this is trying to achieve - it is also unable to unroll the loop because it sees the loop counter is incremented in the loop.

Not entirely sure what you mean? If you mean each thread multiplying more than 1 row of the matrix - yes, I’ve tried that, and it wasn’t as good.

I’m also not sure why using 448 threads per block gives best performance… obviously there are 448 processing cores, but surely that is irrelevant?

Hmm - actually 416 threads per block gives best performance - now up to 100GB/s. But I’m sure there must be a way of getting closer to the theoretical max of 144GB/s… at the moment its 70% of theoretical max…

No, the code was meant as it is. The upper limit of N-3 and the extra for loop were meant to handle the case where N is not a multiple of 4.

Ok, apparently I overestimated the loop unroller’s abilities then. Fixing that, how fast is

[codebox]

#define N 10864

#define M 5672

global void mvKernel(float *a, float *x, float *y){

``````//Performs y = a.x for matrix a (size MxN) and vector x (size N) and vector y (size M)

int i;

float tmp1=0, tmp2=0, tmp3=0, tmp4=0;

float *A;

int j=0;

__shared__ float xs[N];

//copy x into shared memory
``````

#pragma unroll

``````for(i=threadIdx.x;i<N;i+=blockDim.x)

xs[i]=x[i];

``````

A=&a[indx];

#pragma unroll 97 //note 4x97 is a divisor of 10864 (no of slopes):2x2x2x2x7x97

``````for(i=0;i<N-3;i+=4){

tmp1+=xs[i]*A[j];

j+=M;

tmp2+=xs[i+1]*A[j];

j+=M;

tmp3+=xs[i+2]*A[j];

j+=M;

tmp4+=xs[i+3]*A[j];

j+=M;

}

for(i=0;i<N;i++){

tmp1+=xs[i]*A[j];

j+=M;

}

y[indx]=tmp1+tmp2+tmp3+tmp4;
``````

}

[/codebox]?

The idea was that by breaking the dependency chain, the next FMAs can already start executing without having to wait for the preceding FMAs to finish, which should allow more global memory reads to be in flight. Seems I was mistaken however as the compiler will (mostly) move the load instructions before the FMAs anyway.

Yes, that’s what I meant.

So apparently there really is some benefit to read linearly through memory. I vaguely remember another thread which came to the same conclusion.

OK - still not quite as fast - 95GB/s compared with 100GB/s.

Note - the extra loop isn’t needed in this case because n%4==0.

Yes, there is significant benefit to reading linearly through memory - because then coalescing can occur. Which is why in cuda MVM should be done with row major matrices, while in C best done with column major.

Yes, of course. But even coalesced memory accesses seem to get sped up if multiple accesses go to consecutive memory addresses. One might think that there is so much interleaving and mixing because multiple blocks submit their requests in random relative order that the ordering within a block makes no difference. Apparently the memory controllers are sufficiently reordering the requests though to take advantage of multiple accesses to the same row.

I modified the previous code for column major. This can perform the operations in 1456 us on my GTX460m on a 4096 by 4096 matrix.

This would yield ((4096^2)4 + 409642) / (145610^-6 s) = 46.1 GB/s with peak of the card being 60 GB/s =>

46.1/60 = 76.8 % of peak.

I think you will find that the code is very simple:

``````const int N = 4096;

const int threads = 64;

__constant__ float d_x[N];

__global__ static void MY_GEMV_3(float* A, float* result)

{

unsigned int t_x = threadIdx.x + blockIdx.x*(threads);//blockIdx.y*32*(N+64);

float sum = 0.0f;

const int w = 512;

#pragma unroll

for(int i = 0; i < (N)/w; i++)

{

#pragma unroll

for(int j = 0; j < w; j++)

{

sum += A[t_x + (j+i*w)*N] * d_x[j + i*w];

}

}

result[t_x] = sum;

}
``````

Profiler output:

[attachment=21037:Visual_profiler_GEMV.png]

Yes, that is effectively what I am doing - except I let the compiler unroll loops, rather than unrolling some myself (i.e. your 2 nested loops)… my memory ordering is the same as yours - i.e. incremental memory in x requires a step of N in A.

So - for anyone reading this in the future, my conclusion is that a 70% of theoretical memory bandwidth is about the best you will get for a Tesla 2050/2070 performing a matrix vector multiplication. This is somewhat dependent on thread scheduling (number of threads per block), and also on matrix size.

Here is the code that performs this:

``````#define MAXN 12000 //this is the max size of x (to fit in 48kB shared memory).

__global__ void sgemvKernel(int M,int N,const float* __restrict__ a,const float* __restrict__ x,float* __restrict__ y){

//y=A.x

int i;

float tmp=0;

const float *A;

int j=0;

__shared__ float xs[MAXN];

//copy x into shared memory

#pragma unroll

xs[i]=x[i];

__syncthreads();//not sure this is needed (results are correct without it) - but I think in theory it is...

A=&a[indx];

#pragma unroll 512

for(i=0;i<N;i++){

tmp+=xs[i]*A[j];

j+=M;

}

y[indx]=tmp;

}
``````

Call with