Throughputs of the 64-bit sine and cosine instructions

ejo · January 31, 2022, 12:38pm

Hi.
Are the throughputs of the 64-bit sine and cosine instructions available somewhere as per their 32-bit counterparts published here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions?
I’m interested in compute capabilities 7.x.

Robert_Crovella · January 31, 2022, 2:59pm

Not that I know of. These are AFAIK implemented via library calls to routines. Their throughput is not directly established by hardware architecture, but instead by the nature of the code that is written/provided to implement them.

njuffa · January 31, 2022, 8:03pm

For standard math library functions, you would want to measure the throughput. The code is non-trivial, and execution speed will differ based on the throughput of multiple instruction classes, multi-issue capabilities of the hardware, internal branching, and potential constant memory or local memory accesses.

For consumer GPUs with low FP64 throughput, double-precision math functions will typically be bottle-necked on that, so one can get a reasonable rough estimate by simply counting the number of FP64 instructions on the most likely execution path (from disassembled SASS; cuobjdump --dump-sass) and dividing FP64 throughput by that. This obviously does not apply to cc 7.0 / sm_70.

For sin and cos as well as various other math functions, execution time can be data dependent, so one would want to measure using a distribution of input arguments that roughly reflects the use case of interest.

Topic		Replies	Views
estimate 64bit integer instruction throughput CUDA Programming and Performance	4	837	September 29, 2018
Relations between instruction throughput and CUDA compute capability CUDA Programming and Performance cuda	3	845	January 10, 2023
intrinsic functions __cos() and __sin() for double precision CUDA Programming and Performance	11	3229	January 3, 2013
A list of nominal CUDA instructions throughput CUDA Programming and Performance	3	2457	April 11, 2014
32/64 bit question CUDA Programming and Performance	3	352	February 15, 2024
64 bit integer shift instruction throughput CUDA Programming and Performance	3	6763	June 8, 2011
The throughput of 32 bit Integer add instructions not reaching the theoretical maximum of 160 per SM CUDA Programming and Performance	5	1206	January 7, 2014
Throughput for certain integer arithmetic instructions. CUDA Programming and Performance	5	1751	January 15, 2020
Does PTX support double sin() and cos()? CUDA Programming and Performance	4	1536	November 17, 2014
Are there anything like DPP instructions other than shfl.sync? CUDA Programming and Performance	11	825	November 9, 2021

Throughputs of the 64-bit sine and cosine instructions

Related topics