why the cuda c best practices guide still insists on low occupancy → performance degradation? (p.45)
On p. 51 Cuda C Best Practices Guide, it says:
“The compiler optimizes 1.0f/sqrtf(x) into rsqrtf() only when this does not violate IEEE-754 semantics.”
Isn’t the opposite correct?
a mistake again:
“One of the key differences is the fused multiply-add (FMAD) instruction, which combines multiply-add operations into a single instruction execution and truncates the intermediate result of the multiplication.” (On p. 60 Cuda C Best Practices Guide)
fma does not truncate intermediate result of the multiplication…
I think this is the para being discussed.
I think occupancy is a useful guide, but only one of several things to consider. A good thing to check first, before digging deeper.
In this forum one of the very experienced Cuda programmers said something like “you don’t usually get a benefit from occupancy higher than 50%”
Elsewhere I have seen that under extreme conditions it is possible to get nearly maximum performance at 4% occupancy
Personally >> I think designing just for high occupancy is a bad trap, similarly designing to use every byte of the 16k of shared is a trap, and spending hours trying to get the fastest possible design is a trap (and leads to hard to understand/maintain code) <<
Cheers
The key really is hiding latencies through parallelism, which does not have to be at the thread level. Counting on the thread level parallelism only is a mistake.