For high-quality double-single based on FMA the addition/subtraction is actually more expensive than the multiplication. A bunch of double-single code out there takes shortcuts in addition/subtraction, leading to low-accuracy results when the operands are close in magnitude, but of opposite sign, i.e. subtractive cancellation occurs. In other words, they fail the programmer precisely in those situations where improved accuracy is needed.

If memory serves, with FMA support a double-single multiplication is only about 8 instructions, while addition/subtraction are about 20. I do not offhand recall the cost of division, sqrt, rsqrt. I have not worked through the details of any double-native operations beyond that.

Why did I state that the switchover vs native-double is about 1:24 ratio? Because the code bloat caused by double-single also has some negative impact on performance: more registers used, instruction cache hit rate may decline, possible divergence, more difficult for the compiler to optimize.

As you point out, compensated operations (sums, products, dot products, polynomials) often can provide some (or even most) of the benefit of double-native implementations at an attractive fraction of the cost. See the recent thread on compensated operations for literature references:

https://devtalk.nvidia.com/default/topic/815711/cuda-programming-and-performance/observations-related-to-32-bit-floating-point-ops-and-64-bit-floating-point-ops/

But compensated algorithms apply only to some primitives, and require analysis as to where they need to be used.