To start, please read the article I linked above. In short, the compiler does not do this in the general case because the optimization does not produce a bitwise identical result, and for basic arithmetic operations +, -, x, /, CUDA claims bitwise compliance to a proper IEEE-754 result for floating point arithmetic: 123

For the particular example you have shown, you can â€śenableâ€ť the compiler to make such an optimization, one possible method is use of --use_fast_math compiler switch to nvcc.

In general, its unwise in my opinion to attempt to obtain the best understanding of what is going on by using PTX. Instead, studying the SASS gives a better view. The process of converting PTX->SASS goes thru an optimizing compiler stage.

Therefore the effect of the above switch will be â€ślessâ€ť evident at the PTX level, and â€śmoreâ€ť evident at the SASS level.

The switch results in producing PTX for your example that includes this alternate instruction:

div.approx.ftz.f32 ...

Studying the SASS, however, we see that the division routine has been replaced with a single multiply instruction: