Implicit vs. explicite solvers

Hi there,

I am looking at a CFD simulation.

is it right that cuda works much better for explicite solvers, cause of the reduced inter-node communications as this suits the huge number of lower power cpus in the cuda system ?

is there a way of reducing this issue for implicit solvers ?

cheers

Hi there,

I am looking at a CFD simulation.

is it right that cuda works much better for explicite solvers, cause of the reduced inter-node communications as this suits the huge number of lower power cpus in the cuda system ?

is there a way of reducing this issue for implicit solvers ?

cheers

I wouldn’t say “much better”, but it is true that explicit integrators are easier to implement efficiently in CUDA than implicit solvers. That said, there is a lot of literature around showing that CUDA can be used effectively to accelerate implicit solvers (especially iterative solutions to large sparse systems which can leverage matrix-vector product operations). I think it is fair to say that implicit integrators are only the weapon of choice when the characteristics of the problem demand it, but that applies equally to other parallel computation environments as well.

I wouldn’t say “much better”, but it is true that explicit integrators are easier to implement efficiently in CUDA than implicit solvers. That said, there is a lot of literature around showing that CUDA can be used effectively to accelerate implicit solvers (especially iterative solutions to large sparse systems which can leverage matrix-vector product operations). I think it is fair to say that implicit integrators are only the weapon of choice when the characteristics of the problem demand it, but that applies equally to other parallel computation environments as well.

Thanks for your quick answer,

can you recommend any free papers or literature about cuda and implicit solvers ?

cheers

Thanks for your quick answer,

can you recommend any free papers or literature about cuda and implicit solvers ?

cheers

No, but this might be a useful start. I would also point you towards the cusp template library, which has a lot of building blocks for implicit solvers (including CG and BiCGstab routines) available. Dominik Göddeke and his colleagues have also published a series of excellent papers on using CUDA in multi-grid schemes for FEM solvers which you might want to look at, if your problems are amenable to those sort of methods.

No, but this might be a useful start. I would also point you towards the cusp template library, which has a lot of building blocks for implicit solvers (including CG and BiCGstab routines) available. Dominik Göddeke and his colleagues have also published a series of excellent papers on using CUDA in multi-grid schemes for FEM solvers which you might want to look at, if your problems are amenable to those sort of methods.

I would suggest that discontinuous galerkin methods would be the method to look into.

See for example this paper: http://portal.acm.org/citation.cfm?id=1613429

and this presentation: http://www.cfm.brown.edu/people/jansh/page…SAHOM09-GPU.pdf

They just gave a talk at gtc about some extension that had done of this work to the NS equations. See http://developer.download.nvidia.com/compu…10_Archives.htm presentation 2078 by Timothy Warburton.

I would suggest that discontinuous galerkin methods would be the method to look into.

See for example this paper: http://portal.acm.org/citation.cfm?id=1613429

and this presentation: http://www.cfm.brown.edu/people/jansh/page…SAHOM09-GPU.pdf

They just gave a talk at gtc about some extension that had done of this work to the NS equations. See http://developer.download.nvidia.com/compu…10_Archives.htm presentation 2078 by Timothy Warburton.

DG is indeed an approach that is very well suited for massive parallelism, the same is true for the LBM (Lattice-Boltzmann). In the realm of “visual accuracy” aka computer graphics, even simpler approaches like SPH have been demonstrated to fly on GPUs.

The major question however is (no flamewar intended): Does the hardware dictate the numerics or vice versa? We are currently working on Q2~ elements combined with a special multigrid preconditioning technique in an implicit approach, and preliminary results indicate that even such high-end numerical schemes can run well on GPUs. I’m even inclined to claim that tuning for GPUs is not much harder than tuning for CPUs, once you’ve gotten past the point that “standard” compilers generally produce “bad” code…

DG is indeed an approach that is very well suited for massive parallelism, the same is true for the LBM (Lattice-Boltzmann). In the realm of “visual accuracy” aka computer graphics, even simpler approaches like SPH have been demonstrated to fly on GPUs.

The major question however is (no flamewar intended): Does the hardware dictate the numerics or vice versa? We are currently working on Q2~ elements combined with a special multigrid preconditioning technique in an implicit approach, and preliminary results indicate that even such high-end numerical schemes can run well on GPUs. I’m even inclined to claim that tuning for GPUs is not much harder than tuning for CPUs, once you’ve gotten past the point that “standard” compilers generally produce “bad” code…

I’d be inclined to argue that hardware and numerics dictate each other - for the sake of simplicity, let’s consider a feature first introduced in ye olde tymes - floating point arithmatic. Back in the day, the implementation of the FPU cost enough, in terms of silicon area, ect, that including it or not made a real difference. As far as numerics went, you’d be foolish to try to use an algorithm that needed extensive floating point arithmetic on a system without native support. However, as time went on, the utility of the FPU was proven to be great enough to justify dedicated hardware. Of course, it also helped that miniaturization made it such that the cost of adding the unit was not so great as before.

In the modern era of many-core processors, the cost/benefit war has returned in the GPU space, where the question becomes “Should we add a new functionality that may or may not accelerate a given problem, or just add 5% more cores?” The answer lies in how many problems this functionality helps, and by how much as compared to the cost. But this is determined by numerics - that is, find a new algorithm that is both applicable to many problems, and that can be significantly sped up by a certain function, and you suddenly have a very compelling case for adding that functionality to future hardware. In the meanwhile, it would be smart to choose your algorithms according to which ones work best on existing hardware.

Anyway, a modern GPU is functionally more the less the same as a huge cluster of generic CPUs. The big concern for whether an algorithm will perform well on it is whether it can be broken into a great many small pieces that mostly avoid stepping on each others’ toes. Tuning for GPUs can actually be easier than for CPUs since they’re simpler overall. More importantly, the SIMT programming model allows “standard” compilers to produce code that makes “good” use of the vector hardware without resorting to pseudo-assembler and black magic. On the flip side, getting good performance out of the memory subsystem requires more black magic for GPUs than CPUs, due to GPUs’ lack of a large cache.

I’d be inclined to argue that hardware and numerics dictate each other - for the sake of simplicity, let’s consider a feature first introduced in ye olde tymes - floating point arithmatic. Back in the day, the implementation of the FPU cost enough, in terms of silicon area, ect, that including it or not made a real difference. As far as numerics went, you’d be foolish to try to use an algorithm that needed extensive floating point arithmetic on a system without native support. However, as time went on, the utility of the FPU was proven to be great enough to justify dedicated hardware. Of course, it also helped that miniaturization made it such that the cost of adding the unit was not so great as before.

In the modern era of many-core processors, the cost/benefit war has returned in the GPU space, where the question becomes “Should we add a new functionality that may or may not accelerate a given problem, or just add 5% more cores?” The answer lies in how many problems this functionality helps, and by how much as compared to the cost. But this is determined by numerics - that is, find a new algorithm that is both applicable to many problems, and that can be significantly sped up by a certain function, and you suddenly have a very compelling case for adding that functionality to future hardware. In the meanwhile, it would be smart to choose your algorithms according to which ones work best on existing hardware.

Anyway, a modern GPU is functionally more the less the same as a huge cluster of generic CPUs. The big concern for whether an algorithm will perform well on it is whether it can be broken into a great many small pieces that mostly avoid stepping on each others’ toes. Tuning for GPUs can actually be easier than for CPUs since they’re simpler overall. More importantly, the SIMT programming model allows “standard” compilers to produce code that makes “good” use of the vector hardware without resorting to pseudo-assembler and black magic. On the flip side, getting good performance out of the memory subsystem requires more black magic for GPUs than CPUs, due to GPUs’ lack of a large cache.