Solving Poisson Equation, speedup? migration to Open CL

I have developed a specialized fluid hydrodynamic simulation, using the approach called “direct numerical simulation”, DNS. This requires the fewest assumptions, but, until recently, was regarded as completely intractable. The holdup is solving the Poisson Equation by relaxation.

The code was developed using Visual Studio on Windows 7. It is completely CPU bound, not memory intensive. The holdup is solving the Poisson equation. I developed a multi-threaded, recursive multi-grid algorithm, the so-called V-Cycle. It isn’t fast enough. I run it on an AMD Phenom II at 3 gHz.

I first investigated AMD, and bought a 4870x2 card, but their Open Cl platform seems stuck in beta, with loud cries of bad performance. It is possible to get speed out of it if one is prepared to work hard close to the metal, but it seems like too much work, considering the impending transition to Open Cl. The current Nvidia environment is much more refined, but I’ve heard that a factor of 10 improvement over a multi-threaded multi-core CPU approach is questionable.

I have preordered a C2070 card, but that may not show up for nine months.

Suggestions on what to do in the meantime would be appreciated. In single precision, what factor of improvement could be expected solving the Poisson equation by relaxation with multi-grid, in a typical grid size of 20000 ? Perhaps with a GTX295 card?

Giving an exact speedup numbers is always difficult until you specify the algorithm and analyze each part. Even your description above is not close to descriptive enough to make even guesses.

That said, speedups of a factor of 10 are not unreasonable. Some specialized applications can reach 50x. Other apps are not any faster than the CPU at all… it all depends on the bottlenecks and your implementation strategies. There’s a big list of examples here.

A final point, often you need to choose new algorithms to favor the GPU’s strengths. For example, an irregular mesh finite difference solver can run on the GPU, and quickly, but it may be easier to switch to a different algorithm that is easier to parallelize like monte carlo. MC has the big advantage of giving an unbiased answer and also of having each work sample be fairly self contained and independent, allowing embarrassingly-parallel simulation and just merging the results of multiple threads or even multiple GPUs in the end.

Specifically for some Poisson solvers with complex arbitrary domains, this is superior in both speed and accuracy to any gridded finite element method. There’s a simple random walk equation which is quite GPU friendly, and what’s known as a “floating” random walk method with even faster convergence especially in arbitrary domains.

Thank you. I was hoping there was an algorithm easier to parallelize than the V-Cycle. Even multithreading, which requires complex partitioning of the grid between threads, has been difficult.

I understand that the C2050/2070 cards are architecturally different from the C series. What limitations of support does this imply for the current 200 series chips?

Yes, the new Fermi parts will be better in many ways than the G200 parts, but to be honest the details probably shouldn’t affect your experiments or designs now. It’s almost certainly easier to design an algorithm for the current cards (which will ALSO run on the upcoming C2050) than to try to design on hardware that’s not even available yet. Historically this has held up well. G200 is more versatile than G80, but all G80 code runs quite well on G200 and even today, two years later, you usually try to write code that can run on both.

I get between 10-30x speedup for low-order FEM and MG, depending on the cycle type, the smoother (cyclic reduction for line relaxation obviously gives less speedup than Jacobi, SOR is in the middle), and the depth of the hierarchy. Most of the steps in a MG solver parallelise naturally.