Iterative solver speed up How much speed up should I expect?

Hi all,

I have just finished my BICGSTAB solver on GPU with CSR matrix format. I have implemented my own vectorized version of SpMV (not so much optimized) along with the scalar version, switching based on the mean number of non-zeros per row. For vector dot and norm2, I use cublas.

Would somebody please let me know how much speed up should I expect in my code? I am getting almost 6x speed up on a system with an average of 88 non-zeros per row. I am using a GTX 260, Ubuntu 8.10 64 bit, CUDA 2.1, driver coming with CUDA 2.1, Core 2 Quad 9650 CPU and 4 GB of RAM. The dimension of the matrix is around 350,000. I am comparing with the same code running on CPU without taking advantage of OpenMP, etc.

Another question, how much do you think I will benefit from CUDA 2.2 zero-copy? I have not yet read the manual, nor played with it.

This is not a direct answer to your question.

But some of my earlier experiments indicated that CPUs perform 0s based arithmetic much faster. (vague memory). You may want to look @ this to understand why speedups r slow.

Memory bandwidth is the wall you are hitting…

Thanks for replies.

I am sure that my memory access pattern is not optimal in any way. But, I need to know how good or bad is my code, in the sense how far it can be optimized or on the other hand, how much useful GPUs can be in such usages. I have seen on CUDA’ site some 50x speed ups, so I really want to know if that’s possible or not. I need to use double’s which slows down the things.

I think that would be also the question for many newbies like me.

I doubt that anyone is going to be able to give you an exact (or even rough) estimate of what to expect, unless they’ve programmed the exact same thing already. You could get an idea of the best-case scenario by estimating the number of flops that your kernel uses, per thread, then multiplying by the number of threads you are running, then comparing that number to the maximum speed of the GPU you are running on.

For example, if your kernel uses 120 flops to do it’s calculations, and you’re running each iteration over 160,000 data points (a 400x400 matrix), that means a single iteration takes 19,200,000 (19.2M) flops. My 8800GT is supposed to max out around 500 GFlops/sec, so that means the kernel could do around 26,000 iterations per second on my GPU. That number does not take into account the overhead of transferring data back to the host and re-invoking the kernel, so expect the number to be somewhat lower than that.

But to re-iterate (heh), you are the only one who can make a good estimate of the speedup in your code. If you take care of the normal performance improvements (coalescing and other memory optimizations), I doubt that you’re going to get much more improvement without doing anything radical.

To exand a bit on my original (admittedly) terse answer :P
Even if you have perfectly optimal memory access patterns, the 6x speedup you are seeing is likely the best you will see.

And I have coded the exact same thing :P :
http://code.google.com/p/cudaztec/

Not exactly the same, mine was GMRES, which was hit particularly hard by bandwidth issues due to the loop of inner products on each iteration.

The spmv library that I used was pretty much optimal (written by an nvidia pro), and it gave matrix mult speedups of around the 6 that you see. Dot products give almost no speedup. I experimented with my 9800M GTS (64 cores) and found that a large dot product (1M entries) was slower on the card than a single CPU core.

However (I suggest trying the experiment) if you change the basic inner addition of the dot product to a more elaborate calculation (that doesn’t hit memory):

r_i = a_ia_i+b_ib_i + a_i*b_i + sqrt(a_i-b_i)…

you will quickly get to speedups of 140+ on the gpu. It is a number crunching monster!! But with the simple
r_i = a_i + b_i (plus reduction) you just have 64 processors sitting around saying “uhhh, anytime I can get to memory, that would be great…”

cheers

Just my quick 2 cents:

Sometimes if the calculation time is very less – the speedups could be dominated by “cudaMemcpy” (a case with black-scholes option pricing in finance). use pinned memory to accelerate cudaMemcpy and you can see good speedups.

i have implemented a preconditioned CG solver using CUDA and i get to speedups of anywhere between 3x - 10x depending on the size and sparse nature of the matrix. I use this for solving circuits. Depending on the type of circuit (planar / non planar ) i get different matrices. The main difference is in the number of non zeros per row. For planar circuits the maximum is around 40-50 with the average non zeros per row being below 10. For non planar circuits there are a few rows which have something like few thousand non zeros while the overall non zeros per row is again below 10. This gives me varied speedups. The CPU implementation is not parallelized and when I used single threaded MKL, i did not get any performance advantage over the existing implementation.

I used my own kernel for the sparse matrix vector multiplication routine. However it was marginally slower than the Nvidia hybrid kernels. I used cublas for the dot products and another kernel for the vector updates.

EDIT : I use jacobi preconditioning. How good is the cublas dot product routine ? I did not bother with checking its performance because it was a level 1 routine and assumed not much speed up can be obtained. Also more than 70-80% of the time is spent in the sparse matrix multiplication itself.

sparse matrix iterative solvers are constraint mainly by the access pattern to the vector in the smvm this is directly effected by the sparsity pattern of your matrix. One implementation i saw had a sparse pattern of exactly 8 elements on the diagonal, this guy had around x100 speedup. In my problem i get around x10 speedup after allot of hard work.

is x6 not good ?

:) good luck with the optimizations…

you mean x86 ?

i am pretty happy with the 10x speed up infact. We deal with fairly large matrices - 1 million plus rows - the only thing stopping us from going bigger has been the time taken to solve even this on the CPU - it takes about 5 minutes or more. Now i can do it in less than a minute. Plus we can go much bigger. And as you said sparse iterative solvers are difficult to optimize on any hardware.

What card are you using? Perhaps this matters??

My cuBlas is slower than my CPU, so maybe the speedup is card dependent?

i am using GTX 280. I was getting a speedup of upto 6x on a 260 itself. I did not check for cublas performance. Like i mentioned over 70% of the time is spent in SpMV kernel itself. Use the nvidia hyb kernel for the spmv. For the other vector updates i wrote a different kernel and used cublas for dot products.

Thanks everybody for sharing experiences and results! It was quite useful!