Throughputs of the 64-bit sine and cosine instructions

Hi.
Are the throughputs of the 64-bit sine and cosine instructions available somewhere as per their 32-bit counterparts published here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions?
I’m interested in compute capabilities 7.x.

Not that I know of. These are AFAIK implemented via library calls to routines. Their throughput is not directly established by hardware architecture, but instead by the nature of the code that is written/provided to implement them.

For standard math library functions, you would want to measure the throughput. The code is non-trivial, and execution speed will differ based on the throughput of multiple instruction classes, multi-issue capabilities of the hardware, internal branching, and potential constant memory or local memory accesses.

For consumer GPUs with low FP64 throughput, double-precision math functions will typically be bottle-necked on that, so one can get a reasonable rough estimate by simply counting the number of FP64 instructions on the most likely execution path (from disassembled SASS; cuobjdump --dump-sass) and dividing FP64 throughput by that. This obviously does not apply to cc 7.0 / sm_70.

For sin and cos as well as various other math functions, execution time can be data dependent, so one would want to measure using a distribution of input arguments that roughly reflects the use case of interest.

1 Like