occupancy vs. performance question on the cuda c best practices guide

why the cuda c best practices guide still insists on low occupancy --> performance degradation? (p.45)

On p. 51 Cuda C Best Practices Guide, it says:
“The compiler optimizes 1.0f/sqrtf(x) into rsqrtf() only when this does not violate IEEE-754 semantics.”

Isn’t the opposite correct?

a mistake again:
“One of the key differences is the fused multiply-add (FMAD) instruction, which combines multiply-add operations into a single instruction execution and truncates the intermediate result of the multiplication.” (On p. 60 Cuda C Best Practices Guide)

fma does not truncate intermediate result of the multiplication…

I think this is the para being discussed.

I think occupancy is a useful guide, but only one of several things to consider. A good thing to check first, before digging deeper.

In this forum one of the very experienced Cuda programmers said something like “you don’t usually get a benefit from occupancy higher than 50%”

Elsewhere I have seen that under extreme conditions it is possible to get nearly maximum performance at 4% occupancy

Personally >> I think designing just for high occupancy is a bad trap, similarly designing to use every byte of the 16k of shared is a trap, and spending hours trying to get the fastest possible design is a trap (and leads to hard to understand/maintain code) <<

Cheers

The key really is hiding latencies through parallelism, which does not have to be at the thread level. Counting on the thread level parallelism only is a mistake.