Fast gemv()

Anyone done any work with optimizing gemv type operations? I have some iterative solvers where it is necessary to periodically compute a residual of the form r=Ax -b with a dense A matrix. Until now I have been using CUBLAS sgemv for this, which (because the matrix is in row major order) requires a device to device memcpy and a call to the transpose sgemv() version everytime the calculation needs to be done. It turns out that isn’t very fast at all - for a 1024x1024 system, CUBLAS sgemv is only hitting about 25Gb/s on a GTX275, which seems very low. Add in the memcpy and it turns out that a residual calculation is several times slower than a solver iteration.

I had a quick go at trying for something faster, and came up with this:

#define TPB (192)

__constant__ unsigned int N_, LDA_;

__global__ void JORResidual(const float *A, const float *x, const float *b, float *r)


	__shared__ float buff[TPB];

	unsigned int rowIdx = __mul24(LDA_,blockIdx.x);

	// Compute parallel partial sums in shared memory

	buff[threadIdx.x] = 0.f;

	for(int idx = threadIdx.x; idx < N_; idx += blockDim.x) {

		float Aval   = A[rowIdx+idx];

		float xval   = x[idx];

		buff[threadIdx.x] += Aval * xval;




	// Use first warp of block to compute parallel reduction on the

	// partial sum in shared memory.

	if (threadIdx.x < 32) {

		#pragma unroll 

		for(int i=32; i<TPB; i+=32) buff[threadIdx.x] += buff[threadIdx.x+i]; 


	if (threadIdx.x < 16) { buff[threadIdx.x] += buff[threadIdx.x+16]; }

	if (threadIdx.x < 8)  { buff[threadIdx.x] += buff[threadIdx.x+8]; }

	if (threadIdx.x < 4)  { buff[threadIdx.x] += buff[threadIdx.x+4]; }

	if (threadIdx.x < 2)  { buff[threadIdx.x] += buff[threadIdx.x+2]; }

	// Finalise and write out the results to global memory

	if (threadIdx.x == 0)  { 

		r[blockIdx.x] = b[blockIdx.x] - buff[0] - buff[1];



which hits 90Gb/s on my GTX275 at 1024x1024 with a block per matrix row and 192 threads per block, which is a lot faster than CUBLAS sgemv at the same size. Has anyone else had a look at this? Any experiences to share?

I am not much into linear algebra. However the latter part of your code could be replaced by:

if (threadIdx.x < 32) {

	#pragma unroll

	for(int i=32; i<TPB; i+=32) buff[threadIdx.x] += buff[threadIdx.x+i];

	buff[threadIdx.x] += buff[threadIdx.x+16];

	buff[threadIdx.x] += buff[threadIdx.x+8];

	buff[threadIdx.x] += buff[threadIdx.x+4];

	buff[threadIdx.x] += buff[threadIdx.x+2];


While in theory this introduces more work, in practice the SIMD unit is doing it anyway (just with some threads masked). Yet additional time is spent for evaluating the if conditions…

P.S. Do you happen to have some ready-to-compile test code for your global function?

In theory that will make some small improvement, but we are only talking about the last few flops of about 2N^2, which is probably less than 1% of the kernel run time.

I have attached something that is pretty crufty, but might work.

EDIT: A bit too crufty, as it turned out. Fixed.

Played with it a bit but couldn’t find any substantially faster solution to do your work.
Overall I got some questionable few-percent improvement, mostly coming from what I said before + removal of cudaThreadSynchronise in the innermost loop in main (you don’t need that!).

Thanks for trying.

The code which I took the knife to to hack together that example code you have been playing with actually did need it (there was other host code overlapped with a different kernel there). Think of it as vestigial structure ;)

I thought this seemed like a fun problem so I’m also slapping together my own version… I’m wondering about your results, was it initially doing 11.25 GB/s ?

I currently only have a lesser card at hand GTS250 so I’m not sure what kind of performance to expect BUT my first test got me 2.3 GB/s .

I guess i should be getting 5-6 GB/s to have a comparable version for this card.

With a GTX275, the profiler tells me the kernel reaches about 90Gb/s at 1024x1024 (which is about 22.5 gflop/s), At 9216x9216, it is memory bandwidth limited (so about 125Gb/s), at which point it is hitting a bit over 30 gflop/s.

Aha! OK !

Now i understand your results! :)

Will post some numbers later then…

Sov gott!

I didnt look at your code in much detail but it’s the JORResidual time in comparing against right? Total time 466 us ?

Will confirm in the morning but my profiler tells me it’s doing it at approx 560 us…

Yeah the JORResidual function is the “sgemv() like” one I am interested in. The other kernel is doing a Jacobi factorization iteration (which varies only slightly from the matrix-vector multiply anyway. Be aware that the problem it solves is randomly generated at run-time, so the total execution time will vary from run to run, depending on how many iterations are necessary to reach convergence (and how many convergence tests are required).

Ok, so i tried a slightly different approach. I thought it would be great to be able to coalesce along the columns and let each thread handle one row in the A-matrix. To achieve this I begin by transposing the data ( 180 usec ).

Another difference is that since all the threads in a warp will be reading the same x-val all the time i thought it would be great to place these in constant memory (which you might consider to be cheating). You should check to make sure that the cudaMemcpyToSymbol isn’t becoming a bottleneck instead.

So basically

1, Transpose data

2, Let each block have 32 threads working down the rows of the now transposed matrix A_t. Which means we have 32 blocks.

3, Read the x-values from constant memory space.


Method usec

MY_GEMV 384.544
transpose 179.36

So the kernel time on my GTS250 is totally ~564 usec according to the visual profiler. So I’m guessing this should give you pretty good results on your GTX275!

A printscreen from the visual profiler and code is attached below. Mind you, since the execution time is quite low the timers in the code currently don’t do much good, you need yo use the profiler to get accurate timing.



Thanks for the effort. Unfortunately the linux visual profiler I have access to doesn’t want to profile it, so I can’t really tell you about the performance of it. The x vector in constant memory is probably a non-starter because it requires the problem size to be set at compile time, which is going to be impractical for a general purpose subroutine where the size can and will vary. Of course you could set it to something large, but that eats a lot of constant memory which will often be needed for other things.

Yes, i think calling it GEMV is incorrect here. For example on my card the x vector could maximum be ~2048 long for floats since that has already eaten all of my constant memory.

Ah, to bad that it doesnt want to profile it… Check to make sure there isnt a system(“pause”); that slipped in there…

There isn’t a pause() call. I thought it might be because of cutil (yuck), but even after I ripped out the timers it still doesn’t work. The usual trick of adding a cudaThreadExit() call didn’t help either.

Ah, annoying, thanks for trying though! Maybe someone running a windows platform will be able to profile it on a slightly fresher card than mine!

OK, so I found the source of the profiling problem (your homemade initialization code is rather fragile).

EDIT: The code isn’t faster than my version. In fact it is about 5 times slower. For the 1024x1024 case, your transpose requires roughly100 us, then the GEMV kernel takes a further 400 us. The kernel I have takes under 100 us total.

Ah, i had a slight suspicion of that… I threw all of this together quite quickly and used to CUDA VS wizard which sets you up with a init() function which i didn’t care to replace… Sorry for any overall sloppyness in the code :)

I guess my implementation suffers quite a bit on a card with 24 SM:s and you only have 32 blocks in total.

Yeah i was just looking at your timing results… Not the 466 usec:s i was looking at yesterday!?

I’m surprised though that the overall timing didnt go down much from 564 on my card to ~500 on yours, but this may be due to very few blocks.

Look again. 466 usecs for 5 calls…

Ah, darn :)