SP vs DP question regarding Navier-Stokes equation solver


I have a in-house C code for GPU that solves Navier-Stokes equations with detailed chemistry. And trying to figure out which new GPU to buy.

The problem is - I need double precision. But compairing Tesla k40 and Titan Xp I do not see any differences in perfomance for my code. Actually Titan Xp is slightly faster (16% approximately).

But searching in WIKI for examples one can notice ~4-5 times differences for them in DB operations.

What is the reason of this? Is it a memory bottleneck?

nvvp / cudaprof and nSight could analyze and report that for you…

No one here would know for sure what it is, as your code is in-house and not public.

Just because a code uses or needs DP doesn’t necessarily mean that is the performance limiter for that code (all recent GPUs provide some level of DP capability). In fact if your test results are an accurate reflection of code behavior, I would say it does not appear to be the performance limiter for that code.

I am curious how that need has been established. Temperatures, pressures, coefficients for reactions between species, etc are typically not known with an accuracy that requires more than single precision. Is the issue accumulated rounding error over lengthy computations? Is the problem with the limited range of single-precision data (hard to imagine that any physical parameters would exceed it)?

I am sometimes puzzled by scientific applications that use double precision throughout, except that a key function is implemented only with a error bound of 1e-5 (or some such) “for performance reasons”.

There are known techniques for error-compensated computation at key points in a computation, and some of them may be “free”, since the code in question is dominated by memory accesses in the first place. Using single precision instead of double precision cuts down memory traffic.

I had done some analysis of the code. A lot of time is spent onto memory calls (reading data, writing new one, reading again etc…).
But this was done just with timers. I think more detailed investigation is needed as cbucner1 suggests.

Concerning double precision utilization.

The code is meant for fundamental scientific research. It means that the results will be analysed and used further. For example in case of investigation of some flow instability development it is necessary to compute it as precisely as you can in order to try to find linear growth of instability, compare it with analitycal linear theory and prove it or not.

And yes, I know such cases like njuffa described:-). Trying not to do so by every meanings.

If I figure out why there is little differences for my code I will let you know. Just curious what are the real cases where double precision advantage of GPU will be preferable?

Maybe for Navier-Stokes equations it is a well known fact that double precision performance is key to fast computations and I need to seriously rework my code. Or it is fine that I do not observe big differences in performance despite characteristics differences.

A simplistic analysis of a technical code might describe it as “compute bound” or “memory bound”. SP and DP are categories of compute operations (although of course they may imply some level of memory utilization, they don’t really have to).

If your code is not compute bound, then to a first order estimate SP vs. DP does not/should not matter. Of course, if a code is not compute bound in a particular setting, and you constrain the compute throughput enough, it will eventually become compute bound. GPUs often have particular ratios of SP to DP throughput. When the ratios are close to 1, that is a “fast DP” GPU. When the ratios are farther from 1, that is a “slow DP” GPU. Tesla K40 has 3:1 ratio, whereas Titan Xp has a 32:1 ratio. Even though K40 is two generations older than Titan Xp, and will be slower than Titan Xp on many codes, on a truly compute-bound DP code it should be noticeably faster than Titan Xp.

If you limit the scope of the code you are analyzing enough, it should be possible to get a fairly good analysis of compute-bound vs. memory-bound using techniques such as roofline analysis. Most folks in my experience don’t tackle the question with that much rigor, however.

The GPU profilers can allow you to follow an experimental analysis which can give you a pretty good estimate of compute-boundedness vs. memory-boundedness. In fact there are “utilization” metrics, scaled from 0 to 10, which can do much of the analysis for you.

In the scientific and technical computing landscape, most codes are more memory-bound than compute bound. Truly compute-bound codes are hard to find; matrix-matrix multiply is the canonical example. If you can realize your Navier-Stokes calculations as matrix-matrix multiplies (literally calling a matrix-matrix multiply library function) then you have a good candidate for an at least partially compute-bound code. Otherwise there is a good chance your code is memory bound, and as I stated already, the experiment you ran already does not show much indication of compute-boundedness, but since Titan Xp has more memory bandwidth than Tesla K40, it does show some indication of memory-boundedness.

This is a very simplistic treatment, of course. Many disparate factors could be the actual limiter of an application when run on a particular platform, such as the throughput of a particular instruction type, or the capability of the machine in the presence of branching or divergent code, or the specific bandwidth of a specific unit, such as the L2 cache.