Solving Poisson Equation, speedup? migration to Open CL

MTR · February 13, 2010, 6:21pm

I have developed a specialized fluid hydrodynamic simulation, using the approach called “direct numerical simulation”, DNS. This requires the fewest assumptions, but, until recently, was regarded as completely intractable. The holdup is solving the Poisson Equation by relaxation.

The code was developed using Visual Studio on Windows 7. It is completely CPU bound, not memory intensive. The holdup is solving the Poisson equation. I developed a multi-threaded, recursive multi-grid algorithm, the so-called V-Cycle. It isn’t fast enough. I run it on an AMD Phenom II at 3 gHz.

I first investigated AMD, and bought a 4870x2 card, but their Open Cl platform seems stuck in beta, with loud cries of bad performance. It is possible to get speed out of it if one is prepared to work hard close to the metal, but it seems like too much work, considering the impending transition to Open Cl. The current Nvidia environment is much more refined, but I’ve heard that a factor of 10 improvement over a multi-threaded multi-core CPU approach is questionable.

I have preordered a C2070 card, but that may not show up for nine months.

Suggestions on what to do in the meantime would be appreciated. In single precision, what factor of improvement could be expected solving the Poisson equation by relaxation with multi-grid, in a typical grid size of 20000 ? Perhaps with a GTX295 card?

SPWorley · February 13, 2010, 11:35pm

I have developed a specialized fluid hydrodynamic simulation, using the approach called “direct numerical simulation”, DNS. This requires the fewest assumptions, but, until recently, was regarded as completely intractable. The holdup is solving the Poisson Equation by relaxation.

The code was developed using Visual Studio on Windows 7. It is completely CPU bound, not memory intensive. The holdup is solving the Poisson equation. I developed a multi-threaded, recursive multi-grid algorithm, the so-called V-Cycle. It isn’t fast enough. I run it on an AMD Phenom II at 3 gHz.

I first investigated AMD, and bought a 4870x2 card, but their Open Cl platform seems stuck in beta, with loud cries of bad performance. It is possible to get speed out of it if one is prepared to work hard close to the metal, but it seems like too much work, considering the impending transition to Open Cl. The current Nvidia environment is much more refined, but I’ve heard that a factor of 10 improvement over a multi-threaded multi-core CPU approach is questionable.

I have preordered a C2070 card, but that may not show up for nine months.

Suggestions on what to do in the meantime would be appreciated. In single precision, what factor of improvement could be expected solving the Poisson equation by relaxation with multi-grid, in a typical grid size of 20000 ? Perhaps with a GTX295 card?

Giving an exact speedup numbers is always difficult until you specify the algorithm and analyze each part. Even your description above is not close to descriptive enough to make even guesses.

That said, speedups of a factor of 10 are not unreasonable. Some specialized applications can reach 50x. Other apps are not any faster than the CPU at all… it all depends on the bottlenecks and your implementation strategies. There’s a big list of examples here.

A final point, often you need to choose new algorithms to favor the GPU’s strengths. For example, an irregular mesh finite difference solver can run on the GPU, and quickly, but it may be easier to switch to a different algorithm that is easier to parallelize like monte carlo. MC has the big advantage of giving an unbiased answer and also of having each work sample be fairly self contained and independent, allowing embarrassingly-parallel simulation and just merging the results of multiple threads or even multiple GPUs in the end.

Specifically for some Poisson solvers with complex arbitrary domains, this is superior in both speed and accuracy to any gridded finite element method. There’s a simple random walk equation which is quite GPU friendly, and what’s known as a “floating” random walk method with even faster convergence especially in arbitrary domains.

MTR · February 14, 2010, 12:01am

Giving an exact speedup numbers is always difficult until you specify the algorithm and analyze each part. Even your description above is not close to descriptive enough to make even guesses.

That said, speedups of a factor of 10 are not unreasonable. Some specialized applications can reach 50x. Other apps are not any faster than the CPU at all… it all depends on the bottlenecks and your implementation strategies. There’s a big list of examples here.

A final point, often you need to choose new algorithms to favor the GPU’s strengths. For example, an irregular mesh finite difference solver can run on the GPU, and quickly, but it may be easier to switch to a different algorithm that is easier to parallelize like monte carlo. MC has the big advantage of giving an unbiased answer and also of having each work sample be fairly self contained and independent, allowing embarrassingly-parallel simulation and just merging the results of multiple threads or even multiple GPUs in the end.

Specifically for some Poisson solvers with complex arbitrary domains, this is superior in both speed and accuracy to any gridded finite element method. There’s a simple random walk equation which is quite GPU friendly, and what’s known as a “floating” random walk method with even faster convergence especially in arbitrary domains.

Thank you. I was hoping there was an algorithm easier to parallelize than the V-Cycle. Even multithreading, which requires complex partitioning of the grid between threads, has been difficult.

I understand that the C2050/2070 cards are architecturally different from the C series. What limitations of support does this imply for the current 200 series chips?

SPWorley · February 14, 2010, 12:10am

Yes, the new Fermi parts will be better in many ways than the G200 parts, but to be honest the details probably shouldn’t affect your experiments or designs now. It’s almost certainly easier to design an algorithm for the current cards (which will ALSO run on the upcoming C2050) than to try to design on hardware that’s not even available yet. Historically this has held up well. G200 is more versatile than G80, but all G80 code runs quite well on G200 and even today, two years later, you usually try to write code that can run on both.

dominik · February 14, 2010, 10:39am

I get between 10-30x speedup for low-order FEM and MG, depending on the cycle type, the smoother (cyclic reduction for line relaxation obviously gives less speedup than Jacobi, SOR is in the middle), and the depth of the hierarchy. Most of the steps in a MG solver parallelise naturally.

Topic		Replies	Views
Gauss Seidel - Poisson CUDA Programming and Performance	0	1429	May 31, 2010
More details on new Tesla w/ Fermi GPU posted CUDA Programming and Performance	37	11622	December 12, 2009
Multi-GPU: A must in HPC? CUDA Programming and Performance	10	8627	February 10, 2010
Convincing skeptical bigwigs on the future of CUDA CUDA Programming and Performance	49	8875	March 19, 2009
Which hardware should I get? Hardware for massive CFD calculations CUDA Programming and Performance	7	4459	October 29, 2008
Tesla C2050 performance comparision with C1060 CUDA Programming and Performance	63	10541	September 14, 2010
Implicit vs. explicite solvers CUDA Programming and Performance	13	2655	October 24, 2010
Speedup examples?: "Old" GPGPU -> CUDA CUDA Programming and Performance	3	5509	March 6, 2008
conjugate gradient CUDA Programming and Performance	16	9982	June 18, 2008
parallel cfd with mpi & cuda CUDA Programming and Performance	5	1193	November 11, 2010

Solving Poisson Equation, speedup? migration to Open CL

Related topics