PTX is a virtual instruction set that exposes little beyond instructions supported by GPU hardware. There are some exceptions for operations that are commonly present as instructions on other compute platforms, such as integer and floating-point division which are instructions at the PTX level, but really implemented as emulation routines “under the hood”.

GPU hardware provides minimal hardware support for the following higher single-precision operations: reciprocal, reciprocal square root, sine, cosine, exponentiation base 2, logarithm base 2. These are exposed via PTX. CUDA offers some device function intrinsics [such as__sinf(), __cosf()] which are thin wrappers around these PTX instructions. If CUDA code is built with -use_fast_math, some math library functions [such as sinf() and cosf()] are mapped automatically to the corresponding intrinsic. From your description above it sound slike this is how you may be building your code?

You can find the supported PTX instructions in the document ptx_isa_4.1.pdf that ships with CUDA. For your purposes, you would want to consult section 8.7.3 Floating-point instructions. For example, the PTX instruction “sin” is described in sub-section 8.7.3.18 with the following synopsis:

sin.approx{.ftz}.f32 d, a;

As can be seen, there is no double-precision version of this instruction (since no such hardware instruction exists in the GPU).

Generally, the single-precision hardware implementations mentioned above are very high performance but “quick & dirty” since they were designed for use in graphics. Comprehensive math libraries for general computation obviously require many more functions and also typically need higher accuracy and better special case handling as prescribed by the IEEE-754 floating-point standard and the ISO C/C+ standards. Note also that the hardware does not provide any kind of higher double-precision operations.

Like just about any other computing platform including x86 and ARM, CUDA therefore ships with a math library that sits on top of the assembly language level (i.e. upstream of PTX) in the software stack. In CUDA 6.5, the math library is provided as part of a device library. The documentation for this device library resides in a file called libdevice-users-guide.pdf that ships with CUDA. The actual code is in multiple files libdevice.compute_??.??.bc. Best I know these libraries are usable by tool chains other than CUDA and I believe there is at least one project which makes use of that.

Here is a presentation from GTC 2013 that shows how GPU compilers are structured. On slide 11 it is shown where the contents of libdevice enters the flow inside the tool chain, well before the PTX assembly code is generated:

http://on-demand.gputechconf.com/gtc/2013/presentations/S3185-Building-GPU-Compilers-libNVVM.pdf