Oh, i found a bug in the code… Change the kernel to this and the results should at least improve by 3-4x:
__global__ static void MY_GEMV(float* A, float* result)
{
unsigned int t_x = threadIdx.x + blockIdx.x*32;
float sum = 0.0f;
#pragma unroll
for(int row = 0; row < N; row++)
{
sum += A[t_x + row*N]*d_x[row];
}
result[t_x] = sum;
}
Seems i was unnecessarily reading and writing to global… But including the transpose i guess your version will still be twice as fast, but what about for larger matrices? I think that is when the payback for transposing might come…
It certainly is faster with the unnecessary global load and stores removed from the sgemv kernel, but it is still twice as slow as what I started with once the transpose is added in. At larger sizes the performance of the gemv kernel is very good, but the overall rate is no better, because the transpose seems to become very costly (more expensive by iteself that my sgemv kernel). These are at 9216x9216:
My compiler started complaining as soon as i passed 8kb on this card, but those numbers seem familiar. Will check if there is any difference for 1.1 devices.
Incidentally, it also just occurred to me that your code is calculating r=Ax, rather than r=Ax-b, like mine is, so there are some additional memory loads that would have to be included somehow as well.
Yes Ive been well aware of this, my only interest was in the matrix-vector multiplication… Actually thought maybe you had modified your kernel by removing this code for better comparision… Maybe ‘b’ could be loaded in a similar fashion as ‘x’.
It doesn’t matter though, you already have the fastest implementation!
shared memory read writes aren’t as fast as using registers instead ( ~20-30 ns). But if you’re running the smaller 1024x1024 this probably won’t be visible.
Thanks for looking further - there are a couple of little “1%” things that could be tweaked to wring the last few FLOPs out of it. The results of the original code I came up with really suprised me compared to CUBLAS, which I have always thought was reasonable. The code in the first post is over twice as fast as the transpose CUBLAS sgemv at every size I have tested it at (all in single precision GFLOP/s):
I think I’ll stick to questions and hints about your jacobi.cu =)
You say that you need to do a dense matrix vector product but what you are actually with is sort of sparse. If you do not plan to change the matrix, you might very well gain alot from taking advantage of that (this relates to JORkernel as well).
You will definiately speed up your algorithm by transposing the matrix throughout all of your kernels: The thing is that every element in x is being accessed as many times as you have rows! By transposing you will be able to load a piece of the vector into shared memory and then treat the corresponding part of the matrix. This will reduce reads from the vector by a factor of 192… And as far as I can see you have the same ‘problem’ in JORkernel as well. Excess global memory access is bad!
Related to your JORkernel, I would highly recommend you to remove the branching in a loop. That’s really not good. Try and see if you can find a way to do it outside the loop instead.
And increase JORkernel executions by a factor 10 - memcpy is rediculously expensive!
Dense algebra performs considerably better in this application - the sparse solver goes to great lengths to reorder into dense blocks that can use dense blas for precisely this reason.
I don’t believe that is even close to true. The problem is transposition is very expensive - you replace N^2 LHS vector reads with N^2 LHS matrix reads to do the transpose, which is not any more efficient. Jimmy’s code showed that graphically. Using the accepted optimal transpose code, his version winds up being two to three times slower at large sizes than the non-transposed version.
Branching isn’t necessarily bad. Divergent branching is bad. For a vector length of 9216, only a single warp will diverge exactly once out of the 288 warp executions per block required to retire the loop. The rest don’t, and the overall peformance hit is very small - there is less than 1% difference in overall performance between JORkernel and JORresidual, with one containing a branch in the summation loop, and one not.
Actually it isn’t. The device->host memcpy for a 9216 long single precision vector into pageable memory takes 45 us (that is slightly less that 1GB/s), compared with a kernel run time of about 5500 us. So the copy only costs about 0.8% of additional kernel runs. You can see this in the attached profiler output.
I will grab your column major order kernel from the other thread and benchmark it at some point to see how it performs.
Just as a note, I modified my kernel last week to see if it would be faster to do the transpose “in place” in my MY_GEMV kernel since that would cut the load/stores from global memory in half.
The performance was of course better but still around 19% slower than your version…
Yeah I was thinking about in-place transformation as well, but I still couldn’t come up with a scheme that looked like it would be better, especially at small problem sizes, where it becomes hard to get enough warps per multitprocessor and total blocks to keep throughput up. Especially on a GT200 with 30 MP.
As for partition camping, I think it should only be an issue in that execution configuration on cards with 8 memory banks (ie.GTX280/285 or C1060/S1070). What card did you run that on?
Yeah, so I did do a fix for partition camping which gave ~22 % performance increase. So it seems my version is actually faster now, at least on my GTS250.
[attachment=16561:my_mv.png]
Will have to check how this scales for larger problems as well… Good thing i checked for the partition camping though!
Sure! My suspiscion is that a solution with only 32 blocks isn’t going to scale well on a card with 30 SMs though, since it will be harder to hide the off-chip latencies.
It might need a rewrite/modification to work better on such a card. Maybe i will write something for that later today, anyways have a go at it:
[attachment=16566:MY_GEMV.cu]
Oh btw, check the “padding” parameter if it works for your memory interface… This is a bit crude but it was a quick and easy way to solve it :)
Here is a new version, it’s about 11 usec slower on my card but might actually scale better on yours leading to overall improvement. There are now 64 blocks for N=1024…