conjugate gradient

hi,

what do you think would be right direction in implementing cg solver with cuda?

Basically:
External Media

would it be enough to upload A and C (preconditioner) to GPU once at start, on every iteration upload vectors, do the multiplication and download the result, letting CPU do the rest of work - this would treat GPU as nothing more as matrix-vector multiplication coprocessor.

Or maybe I should move, in some way, the whole algorith to GPU?

First solutions sounds easy to implement, but I don’t know if multiplication alone would amortize the CPU-GPU-CPU transfers. Second solutions sounds harder and I don’t have a clear idea how to do it right now.

Please share your thoughts - what would be the right strategy to do it? thx

(I can’t estimate things myself right now since I only have access to g80 at university - so I can only do theoretical investigation)

Performing only the matrix-vector-multiplication probably won’t help speed-up things. The “rapid prototyping” approach is probably to write a CUDA kernel for every “operation”:

  • one for dot products
  • one for matvec
  • one or two for the vector updates

You might also want to be careful about precision, I did a CG for Finite Element PDE systems, and I definitely needed a double precision update loop on the CPU every once in a while (say, every 500 iterations or so)

Im also trying to implement CG for equations resulting from FEM elastostatics problem. Are you planning to make your implementation available for wider audition? Or maybe share some more insights with us - for. ex are you performing main loop on cpu, and only dispatch the “operations”: matvec, vecvec to gpu - does such strategy perfroms better than just the cpu, for ex. for mat size about 10000? thx a lot

From my experience, with a size of only 10000 you will have to perform almost everything on the GPU, and not a separate kernel for each function. In my case, e.g. a scalar product on the GPU takes about 12 us, the matvec about 60 us (the matrix has about 7 entries per matrix row in an almost, but not completely, band structure), which means that the lauch time of 20 us for the kernels is very significant unless you manage to combine several operations in one kernel (note that this is even with double precision emulated via float2).

Oh, and in case you know of a preconditioning method that works really well on the GPU and is not too complex to implement, I am still searching for something like that ;-)

I am only using Conjugate Gradients as coarse grid solver for multigrid, so no, I probably won’t make my implementation available in the near future. It’s too special.

Anyway, I run the main loop on the CPU. My matrices are large enough, so that any launch overhead (or transferring results of dot products for convergence checks etc) just are not dominant. The code is a mixed precision iterative refinement scheme, you can find some pseudocode in my paper on the topic (it’s a bit old regarding performance results, but otherwise valid):
[url=“http://www.mathematik.tu-dortmund.de/~goeddeke/pubs/ijpeds06.html”]http://www.mathematik.tu-dortmund.de/~goed...s/ijpeds06.html[/url]

As for preconditioners: SPAI (sparse approximate inverse) might be reasonably useful. In some preprocessing stage, you pass your matrix to some library the SPAI guys provide, and they automagically compute an approximate inverse, and this is the trick, with a nonzero pattern you specify. This is costly, but amortises pretty well if you need enough preconditioning steps, which become a simple matrix-vector multiplication. Reverse-engineering the SPAI code in CUDA directly on the GPU is, well, not recommended…

If you have some results. It would be very nice to see them.
We tried some CG solver here in the lab, but our conclusion for sparse systems is that a multi-core system is faster than a CUDA implementation.

If you only did a quick try, a speedup is unlikely IMO.

The results I have at hand are for the OpenFOAM “simpleFoam” tutorial (more precisely the “pitzDaily3Blocks” part), which is probably not a truly realistic example, and in addition biased due to the slow CPU in the system (see my profile) and it using only a single CPU/GPU.

This uses the PCG method with DIC preconditioning.

Time of the original implementation:

3:45 min, ca. 170 iterations per time step

First naive implementation using GPU only for matvec etc., preconditioning on CPU:

7:33 min

Current implementation, using only diagonal normalization instead of true preconditioning, and merging as many function in one single GPU kernel as possible:

2:56 min, ca. 600 interations per time step

The biggest annoyance to me at this point is, that executing GPU kernels pegs the CPU with 100% load.

I did do some experiments with SPAI, but for this problem the full algorithm is too slow, and a reduced one I tried did not work well at all. Unfortunately my numerical mathematics courses have been a long time ago, but I think the problem might be that the parallel SPAI algorithm results in non-symmetric matrices which PCG can not handle.

Currently I am experimenting with (non-adaptive) polynomial preconditioning which looks promising if I can optimize the speed, but obviously it’s not exactly a beautiful solution.

Sure I know. But as usual it depends on the constraints you have. Our setting are real time fluid and FEM simulations. Thus, our systems are small and we don’t need many iterations. So for systems of equations almost fitting in the CPU cache you will have problems in beating the CPU with CUDA.

Edit: Sorry, that should have gone below the comment of Reimar.

Check the simplestreams example. The CPU does only busy-wait on things like cudamemcpy and when you have called 16 (or I believe 24 on 1.1 hardware) kernels in a row and the first one did not finish yet.

Well, maybe I am just doing something wrong, but what I see so far is that neither cudaEventRecord + cudaEventSynchronize nor a cudaStreamQuery + usleep loop help anything, whether I do them every 9, 18 or several ca. 160 kernel calls.

If I had to guess I’d say this is probably because those 18 kernel calls all together just take about 1ms to complete, including setup time.

Oh, and I have now another data point:

Using polynomial preconditioning with degree 4 (3 additional matvec per iteration):

2:29 min, ca. 150 interations per time step

Note that the OpenFOAM example unfortunately spends only about 50% of time in PCG, so it might be time to see how to optimize some other parts.

While I still believe that PCG is not the kind of algorithm that is optimal from a numerical point of view (unless you precondition it with patch-wise multigrid, and end up with the usual drawbacks of Schwartz-CG approaches), I am still curious as to how an “optimal” CG implementation for a sparse system looks like in CUDA.

The best we came up with is our so-called (residual guided) pipelined CG which we originally designed for FPGAs:
[url=“http://www.mathematik.tu-dortmund.de/~goeddeke/pubs/fccm06.html”]http://www.mathematik.tu-dortmund.de/~goed...ubs/fccm06.html[/url]
The “pipelined” CG is essentially three kernels, one for MatVec, one for vector updates, and one for all reductions. If the matrix is banded, two kernels suffice. I don’t believe you can squeeze an entire CG in less than this number of kernels, but I’m happy to be proven wrong. The residual guided variant is tailored to mixed precision schemes btw.

Please let me know what you think.

btw, if SPAI breaks symmetry, why don’t you just run BiCG instead of CG? Preconditioned BiCGStab is just the same effort to implement in CUDA as CG is…

Hi,

I implemented a CG solver using CUDA and, compared to the CPU version, gets an average 5x speedup (10x in some veery fortunate cases), from start to double precision accuracy. I am using either iterative refinement, like Dominik said before, where I do correction loops on the CPU, or emulated double precision. For a standard CG implementation, the restarts become a problem, so the emulated double works much better in the majority of the cases. Also, all BLAS1 and BLAS2 operations in the CG algorithm are so memory bounded that you can get away with doing emulated double for a very small performance penalty.

Now of course, I don’t have a decent preconditioner … only diagonal scaling … so there is work to do there. If you use SPAI, you should store the inverse on the GPU as well, right? And trying SPAI in PETSc took forever, compared to IC.

For who’s interested, I intend to make the code public when it reaches a more stable form. It is usable, but the configuration script sucks and probably has lots of undiscovered bugs. And of course, is linux only for the moment.

Interesting. I guess you have somewhat larger test matrices?

Hm, at least with emulated double, the matrix-vector multiplications are very strongly compute-bound for me. This is after I implemented some kind of manual caching using the shared memory though (a band-like matrix structure is very helpful here).

Btw. how do you determine if it is compute or memory bound? To me varying GPU and GPU memory clock seems to be most reliable.

Right, the smallest I use is around 10,000 rows, but the majority have 100,000 or more.

To see if it’s memory bound or not, I run the single precision and the emulated double precision versions for the same kernel and the same data, and compared the performance. If the bandwidth I get for both it’s the same, then it’s memory bound. Or another way to look at this would be to see if the performance for single precision would be 1.6x the one for emulated doubke, since this is as good as it can be.

For dot product, for large vector sizes, I get a little compute bound (around 30% difference between the two) while for AXPY and SCAL I get the same bandwidth.

For SPMV, for most of the matrices, I am still compute bound, with less than 20% difference in bandwidth for most of the matrices. There are some matrices though for which there is 40-50% difference, but I don’t consider this to be “very strongly compute bound”. After all, there are 10-30x times more instructions. For this reason, I would expect the 70GFlop/s or so of double precision for the 280GTX to be more than enough for CG.

Btw, are you working in banded format or CSR ?

I have recently (in the last month) implemented a Poisson solver using Conjugate Gradient in cylindrical coordinates. Grid size: 128x512x128 (8 million elements). Standard central difference stencil. I get speedup of almost 30x against Fortran 90 code.
Speedup is achieved using linear textures. This saves on registers, increases occupancy. I run one kernel for matrix/vec multiplication, and then simply use cuBlas. Speed is about 30 ms per conjugate gradient iteration.

Gordon

More CSR like, but optimized for a common maximum number of entries per row.

I still have a few experiments to do in that regard, too though. Currently the format is float4 with .x = value high, .y = value low, .z = column, .w = last/previous last element

That probably is suboptimal, and especially due to symmetry it might make far more sense to use the z and w values in the diagonal entries for the values of the secondary diagonal, and I also have some doubts that the matrix itself really needs to use emulated double precision.

For performance reasons I am doing a two-step reduction, though the second step can be merged with the vector updates, with a slight performance loss - but in that case vector updates and MatVec can not be merged without quite some performance loss I think, the optimal grid configuration is too different.

And the reason why I use PCG is because OpenFOAM uses this as the “main” CG method and the original idea was to make GPU acceleration “plug-in” like, invisible to the user - though I now think this is not possible without sacrificing a lot of performance.

And just in case I say something that sounds stupid to you: keep in mind I have been at this only one month, including reading some existing, little documented code, so I still have quite a bit to learn :-)