A simplistic analysis of a technical code might describe it as “compute bound” or “memory bound”. SP and DP are categories of compute operations (although of course they may imply some level of memory utilization, they don’t really have to).

If your code is not compute bound, then to a first order estimate SP vs. DP does not/should not matter. Of course, if a code is not compute bound in a particular setting, and you constrain the compute throughput enough, it will eventually become compute bound. GPUs often have particular ratios of SP to DP throughput. When the ratios are close to 1, that is a “fast DP” GPU. When the ratios are farther from 1, that is a “slow DP” GPU. Tesla K40 has 3:1 ratio, whereas Titan Xp has a 32:1 ratio. Even though K40 is two generations older than Titan Xp, and will be slower than Titan Xp on many codes, on a truly compute-bound DP code it should be noticeably faster than Titan Xp.

If you limit the scope of the code you are analyzing enough, it should be possible to get a fairly good analysis of compute-bound vs. memory-bound using techniques such as roofline analysis. Most folks in my experience don’t tackle the question with that much rigor, however.

The GPU profilers can allow you to follow an experimental analysis which can give you a pretty good estimate of compute-boundedness vs. memory-boundedness. In fact there are “utilization” metrics, scaled from 0 to 10, which can do much of the analysis for you.

In the scientific and technical computing landscape, most codes are more memory-bound than compute bound. Truly compute-bound codes are hard to find; matrix-matrix multiply is the canonical example. If you can realize your Navier-Stokes calculations as matrix-matrix multiplies (literally calling a matrix-matrix multiply library function) then you have a good candidate for an at least partially compute-bound code. Otherwise there is a good chance your code is memory bound, and as I stated already, the experiment you ran already does not show much indication of compute-boundedness, but since Titan Xp has more memory bandwidth than Tesla K40, it does show some indication of memory-boundedness.

This is a very simplistic treatment, of course. Many disparate factors could be the actual limiter of an application when run on a particular platform, such as the throughput of a particular instruction type, or the capability of the machine in the presence of branching or divergent code, or the specific bandwidth of a specific unit, such as the L2 cache.