I use arguments to these functions in double-precision much smaller than the maximum stated before the slow path is used, yet when I compile a kernel using double-precision trig functions with --ptxas-options -v, it reports 40 bytes of local memory used, which I would like to avoid because my arguments should never need the slow path.
Is this local memory usage real, or is it only being reported because the compiler can’t know the size of the arguments, and hence the actual path used during the computation is the fast path? How can I tell if local memory is actually being used during the computation?
Both the fast path and the slow path are present in the generated code, therefore local memory for the slow path shows up in compiler statistics. However, as long as the magnitude of the function arguments does not require use of the slow path, this allocated local memory will not be used. I have yet to see a real-life application that exercises the slow path in the trig function argument reduction, which doesn’t mean such applications don’t exist.
The CUDA driver pre-allocates a certain amount of local memory for every thread in a kernel (for example, for use by the ABI or for register spilling), so the mere presence of those 40 bytes worth of local memory should not make any difference one way or the other. What prompted the addition of the cited passage to the Programming Guide was the misconception on the part of some CUDA users that any occurence of local memory in the compiler’s memory statistics for a kernel means something bad happened to the code and it will be slow.
If the sin() and cos() calls are of the form sin(Ï€ * [expr]) or cos (Ï€ * [expr]) I highly recommend using the functions sinpi() and cospi(). They do not have a slowpath meaning there is less code, don’t require local memory, use fewer registers, and provide increased accuracy compared to the explicit multiplication followed by a call to the corresponding “regular” trig function.