If I have a kernel which computes say “a[i] = sin(x) + con(x)”. Also if we assume that there are as many SFUs as active threads in SM (say for example 1024). Half of the threads inside a block do the above computation. My question does sin and cos be computed in parallel and take x cycles instead of doing sequential and taking 2x cycles since there are SFUs available to compute both in parallel? Thanks
 There is no dual issuing capability for the SFU. In a given cycle, each thread issues at most one SFU operation.
 There are fewer SFUs than normal arithmetic units that perform single-precision adds, multiplies, and FMAs. Per Table 3 in section 5.4.1 of the CUDA Programming Guide, the ratio is 1:4 for Turing and 1:8 for Ampere. This also means the assumption “there are as many SFUs as active threads in SM (say for example 1024)” does not normally hold.
 Depending on compilation switches,
cos(float) may not use the SFU at all.