regarding transcendental instruction execution cycles in Fermi

Sai · November 16, 2010, 12:07am

In Fermi architecture…

It is given that
For devices of compute capability 2.0, a multiprocessor consists of:
32 CUDA cores for integer and floating-point arithmetic operations,
4 special function units for single-precision floating-point transcendental functions,
2 warp schedulers

A warp scheduler can issue an instruction to only half of the CUDA cores. To
execute an instruction for all threads of a warp, a warp scheduler must therefore
issue the instruction over:
2 clock cycles for an integer or floating-point arithmetic instruction,
2 clock cycles for a double-precision floating-point arithmetic instruction,
8 clock cycles for a single-precision floating-point transcendental instruction.

When there are only 4 special function units that execute transcendental functions per multiprocessor, and each warp scheduler can issue an instruction to only 16 cores,
number of execution cycles for transcendental instruction should be like 32/2 = 16 cycles. since out of 4 special function units only 2 will be used for each warp instruction.

Is my understanding wrong?

Sai · November 16, 2010, 12:07am

In Fermi architecture…

It is given that
For devices of compute capability 2.0, a multiprocessor consists of:
32 CUDA cores for integer and floating-point arithmetic operations,
4 special function units for single-precision floating-point transcendental functions,
2 warp schedulers

A warp scheduler can issue an instruction to only half of the CUDA cores. To
execute an instruction for all threads of a warp, a warp scheduler must therefore
issue the instruction over:
2 clock cycles for an integer or floating-point arithmetic instruction,
2 clock cycles for a double-precision floating-point arithmetic instruction,
8 clock cycles for a single-precision floating-point transcendental instruction.

When there are only 4 special function units that execute transcendental functions per multiprocessor, and each warp scheduler can issue an instruction to only 16 cores,
number of execution cycles for transcendental instruction should be like 32/2 = 16 cycles. since out of 4 special function units only 2 will be used for each warp instruction.

Is my understanding wrong?

Nighthawk13 · November 19, 2010, 3:37pm

Only one of the two warp schedulers of a multiprocessor issues transcendental instructions. The warp is executed on all 4 cores. 32/4 = 8 Cycles.

Nighthawk13 · November 19, 2010, 3:37pm

Only one of the two warp schedulers of a multiprocessor issues transcendental instructions. The warp is executed on all 4 cores. 32/4 = 8 Cycles.

SPWorley · November 19, 2010, 5:32pm

Is this correct? I thought that the warp schedulers partitioned themselves based on the phase of the warp number… one scheduler does even warps, one scheduler does odd warps.

SPWorley · November 19, 2010, 5:32pm

Is this correct? I thought that the warp schedulers partitioned themselves based on the phase of the warp number… one scheduler does even warps, one scheduler does odd warps.

paulius · November 19, 2010, 7:54pm

The above is not perfectly correct.

One instruction scheduler is responsible for even warps, the other one for odd warps. The cores are not partitioned among the schedulers. It may help to convert the core counts to pipelines with certain instruction throughputs (as Nighthawk13 did). So, a Fermi gf100 (GTX 480, C2050, C2070, …) has fp32 pipelines capable of issuing 2 warps worth of instructions in 2 ticks, same for most int32 instructions, transcendental pipelines capable of issuing 1 warp worth of instructions in 8 ticks, etc. These are listed in Section 5.4.1 of the CUDA C Programming Guide.

Now, at run-time each scheduler issues an instruction from its pool of warps if:

all instruction arguments are ready
a pipeline is available

paulius · November 19, 2010, 7:54pm

The above is not perfectly correct.

One instruction scheduler is responsible for even warps, the other one for odd warps. The cores are not partitioned among the schedulers. It may help to convert the core counts to pipelines with certain instruction throughputs (as Nighthawk13 did). So, a Fermi gf100 (GTX 480, C2050, C2070, …) has fp32 pipelines capable of issuing 2 warps worth of instructions in 2 ticks, same for most int32 instructions, transcendental pipelines capable of issuing 1 warp worth of instructions in 8 ticks, etc. These are listed in Section 5.4.1 of the CUDA C Programming Guide.

Now, at run-time each scheduler issues an instruction from its pool of warps if:

all instruction arguments are ready
a pipeline is available

Topic		Replies	Views
warp scheduler of Fermi architecture CUDA Programming and Performance	2	3209	February 5, 2012
Scheduler concept inside FERMI CUDA Programming and Performance	2	7245	March 25, 2011
Warp threads execution model CUDA Programming and Performance	8	2770	January 19, 2010
Basic question about warps CUDA Programming and Performance	14	6587	June 9, 2009
Threads Dispatching : 2 different instructions per cycles? CUDA Programming and Performance	2	33	January 31, 2025
"Half-warps", scheduling, and branch divergence CUDA Programming and Performance	3	4302	February 24, 2013
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15585	February 4, 2011
Stupid (?) questions about Warp vs. Half Warp vs. SM width CUDA Programming and Performance	3	43756	November 12, 2010
Clarifing the process of issuing instructions on CUDA devices CUDA Programming and Performance	5	332	March 26, 2024
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24850	September 6, 2009

regarding transcendental instruction execution cycles in Fermi

Related topics