If I have a kernel which computes say “a[i] = sin(x) + con(x)”. Also if we assume that there are as many SFUs as active threads in SM (say for example 1024). Half of the threads inside a block do the above computation. My question does sin and cos be computed in parallel and take x cycles instead of doing sequential and taking 2x cycles since there are SFUs available to compute both in parallel? Thanks

[1] There is no dual issuing capability for the SFU. In a given cycle, each thread issues at most one SFU operation.

[2] There are fewer SFUs than normal arithmetic units that perform single-precision adds, multiplies, and FMAs. Per Table 3 in section 5.4.1 of the CUDA Programming Guide, the ratio is 1:4 for Turing and 1:8 for Ampere. This also means the assumption “there are as many SFUs as active threads in SM (say for example 1024)” does not normally hold.

[3] Depending on compilation switches, `sin(float)`

and `cos(float)`

may not use the SFU at all.

1 Like