Anyone done any work with optimizing gemv type operations? I have some iterative solvers where it is necessary to periodically compute a residual of the form r=Ax -b with a dense A matrix. Until now I have been using CUBLAS sgemv for this, which (because the matrix is in row major order) requires a device to device memcpy and a call to the transpose sgemv() version everytime the calculation needs to be done. It turns out that isn’t very fast at all - for a 1024x1024 system, CUBLAS sgemv is only hitting about 25Gb/s on a GTX275, which seems very low. Add in the memcpy and it turns out that a residual calculation is several times slower than a solver iteration.
I had a quick go at trying for something faster, and came up with this:
#define TPB (192)
__constant__ unsigned int N_, LDA_;
__global__ void JORResidual(const float *A, const float *x, const float *b, float *r)
{
__shared__ float buff[TPB];
unsigned int rowIdx = __mul24(LDA_,blockIdx.x);
// Compute parallel partial sums in shared memory
buff[threadIdx.x] = 0.f;
for(int idx = threadIdx.x; idx < N_; idx += blockDim.x) {
float Aval = A[rowIdx+idx];
float xval = x[idx];
buff[threadIdx.x] += Aval * xval;
}
__syncthreads();
// Use first warp of block to compute parallel reduction on the
// partial sum in shared memory.
if (threadIdx.x < 32) {
#pragma unroll
for(int i=32; i<TPB; i+=32) buff[threadIdx.x] += buff[threadIdx.x+i];
}
if (threadIdx.x < 16) { buff[threadIdx.x] += buff[threadIdx.x+16]; }
if (threadIdx.x < 8) { buff[threadIdx.x] += buff[threadIdx.x+8]; }
if (threadIdx.x < 4) { buff[threadIdx.x] += buff[threadIdx.x+4]; }
if (threadIdx.x < 2) { buff[threadIdx.x] += buff[threadIdx.x+2]; }
// Finalise and write out the results to global memory
if (threadIdx.x == 0) {
r[blockIdx.x] = b[blockIdx.x] - buff[0] - buff[1];
}
}
which hits 90Gb/s on my GTX275 at 1024x1024 with a block per matrix row and 192 threads per block, which is a lot faster than CUBLAS sgemv at the same size. Has anyone else had a look at this? Any experiences to share?
While in theory this introduces more work, in practice the SIMD unit is doing it anyway (just with some threads masked). Yet additional time is spent for evaluating the if conditions…
P.S. Do you happen to have some ready-to-compile test code for your global function?
In theory that will make some small improvement, but we are only talking about the last few flops of about 2N^2, which is probably less than 1% of the kernel run time.
I have attached something that is pretty crufty, but might work.
Played with it a bit but couldn’t find any substantially faster solution to do your work.
Overall I got some questionable few-percent improvement, mostly coming from what I said before + removal of cudaThreadSynchronise in the innermost loop in main (you don’t need that!).
The code which I took the knife to to hack together that example code you have been playing with actually did need it (there was other host code overlapped with a different kernel there). Think of it as vestigial structure ;)
I thought this seemed like a fun problem so I’m also slapping together my own version… I’m wondering about your results, was it initially doing 11.25 GB/s ?
I currently only have a lesser card at hand GTS250 so I’m not sure what kind of performance to expect BUT my first test got me 2.3 GB/s .
I guess i should be getting 5-6 GB/s to have a comparable version for this card.
With a GTX275, the profiler tells me the kernel reaches about 90Gb/s at 1024x1024 (which is about 22.5 gflop/s), At 9216x9216, it is memory bandwidth limited (so about 125Gb/s), at which point it is hitting a bit over 30 gflop/s.
Yeah the JORResidual function is the “sgemv() like” one I am interested in. The other kernel is doing a Jacobi factorization iteration (which varies only slightly from the matrix-vector multiply anyway. Be aware that the problem it solves is randomly generated at run-time, so the total execution time will vary from run to run, depending on how many iterations are necessary to reach convergence (and how many convergence tests are required).
Ok, so i tried a slightly different approach. I thought it would be great to be able to coalesce along the columns and let each thread handle one row in the A-matrix. To achieve this I begin by transposing the data ( 180 usec ).
Another difference is that since all the threads in a warp will be reading the same x-val all the time i thought it would be great to place these in constant memory (which you might consider to be cheating). You should check to make sure that the cudaMemcpyToSymbol isn’t becoming a bottleneck instead.
So basically
1, Transpose data
2, Let each block have 32 threads working down the rows of the now transposed matrix A_t. Which means we have 32 blocks.
3, Read the x-values from constant memory space.
results:
Method usec
MY_GEMV 384.544
transpose 179.36
So the kernel time on my GTS250 is totally ~564 usec according to the visual profiler. So I’m guessing this should give you pretty good results on your GTX275!
A printscreen from the visual profiler and code is attached below. Mind you, since the execution time is quite low the timers in the code currently don’t do much good, you need yo use the profiler to get accurate timing.
Thanks for the effort. Unfortunately the linux visual profiler I have access to doesn’t want to profile it, so I can’t really tell you about the performance of it. The x vector in constant memory is probably a non-starter because it requires the problem size to be set at compile time, which is going to be impractical for a general purpose subroutine where the size can and will vary. Of course you could set it to something large, but that eats a lot of constant memory which will often be needed for other things.
Yes, i think calling it GEMV is incorrect here. For example on my card the x vector could maximum be ~2048 long for floats since that has already eaten all of my constant memory.
Ah, to bad that it doesnt want to profile it… Check to make sure there isnt a system(“pause”); that slipped in there…
There isn’t a pause() call. I thought it might be because of cutil (yuck), but even after I ripped out the timers it still doesn’t work. The usual trick of adding a cudaThreadExit() call didn’t help either.
OK, so I found the source of the profiling problem (your homemade initialization code is rather fragile).
EDIT: The code isn’t faster than my version. In fact it is about 5 times slower. For the 1024x1024 case, your transpose requires roughly100 us, then the GEMV kernel takes a further 400 us. The kernel I have takes under 100 us total.
Ah, i had a slight suspicion of that… I threw all of this together quite quickly and used to CUDA VS wizard which sets you up with a init() function which i didn’t care to replace… Sorry for any overall sloppyness in the code :)
I guess my implementation suffers quite a bit on a card with 24 SM:s and you only have 32 blocks in total.