I think you will find that this applies to floating-point division only, which also matches the example you provided. On x86, floating-point division is down to single-digit cycle execution times now. There should never have been a need to replace floating-point division by two with multiplication manually, as compilers have been routinely applying that substitution for decades.

With the help of FMA, it is possible to accelerate floating-point division by other constants than powers of two, but with the high speed FP divide on x86, that probably doesn’t make sense anymore. Still applicable to GPUs, though, but you might have to do it manually.

[Later:] Checking Agner Fog’s instruction tables, it seems I slightly misremembered. For Skylake, it shows FP division (VDIVPS) latency at 11 cycles, vs 4 cycles for FP multiplication (VMULPS).