In Fermi architecture…

It is given that

For devices of compute capability 2.0, a multiprocessor consists of:

32 CUDA cores for integer and floating-point arithmetic operations,

4 special function units for single-precision floating-point transcendental functions,

2 warp schedulers

A warp scheduler can issue an instruction to only half of the CUDA cores. To

execute an instruction for all threads of a warp, a warp scheduler must therefore

issue the instruction over:

2 clock cycles for an integer or floating-point arithmetic instruction,

2 clock cycles for a double-precision floating-point arithmetic instruction,

8 clock cycles for a single-precision floating-point transcendental instruction.

When there are only 4 special function units that execute transcendental functions per multiprocessor, and each warp scheduler can issue an instruction to only 16 cores,

number of execution cycles for transcendental instruction should be like 32/2 = 16 cycles. since out of 4 special function units only 2 will be used for each warp instruction.

Is my understanding wrong?