(1) For code analysis, always look at the machine code (SASS), not PTX. PTX is a virtual ISA and compiler intermediate representation which is compiled by an optimizing compiler into SASS. You can look at SASS by disassembling CUDA-generated executables with
(2) rcp.rn.f32, sqrt.rn.f32 are properly rounded (in the IEEE-754 sense) reciprocal and square root operations that have no direct hardware equivalent. You will see a longish sequence of SASS operations generated for these.
(3) FP32 special function units exist in all GPUs supported by CUDA. They are actually called MUFU (multi-function unit) and you would be looking for operations like MUFU.EX2, MUFU.RCP, etc in SASS. In Pascal and later architectures, an approximate square root was added to the previously existing set of MUFU operations.
(4) The CUDA compiler defaults to IEEE-754 compliant basic FP32 operations and high-accuracy implementations of FP32 math functions. To get faster approximate versions of some of them, you would want to either use device function intrinsics, such as __log2f(), and/or use compiler switches such a -prec-sqrt=false, -prec-div=false, -use_fast_math. Consult the documentation, in particular the Best Practices Guide, the Programming Guide, and the nvcc documentation.