It is given that
For devices of compute capability 2.0, a multiprocessor consists of:
32 CUDA cores for integer and floating-point arithmetic operations, 4 special function units for single-precision floating-point transcendental functions,
2 warp schedulers
A warp scheduler can issue an instruction to only half of the CUDA cores. To
execute an instruction for all threads of a warp, a warp scheduler must therefore
issue the instruction over:
2 clock cycles for an integer or floating-point arithmetic instruction,
2 clock cycles for a double-precision floating-point arithmetic instruction, 8 clock cycles for a single-precision floating-point transcendental instruction.
When there are only 4 special function units that execute transcendental functions per multiprocessor, and each warp scheduler can issue an instruction to only 16 cores,
number of execution cycles for transcendental instruction should be like 32/2 = 16 cycles. since out of 4 special function units only 2 will be used for each warp instruction.
It is given that
For devices of compute capability 2.0, a multiprocessor consists of:
32 CUDA cores for integer and floating-point arithmetic operations, 4 special function units for single-precision floating-point transcendental functions,
2 warp schedulers
A warp scheduler can issue an instruction to only half of the CUDA cores. To
execute an instruction for all threads of a warp, a warp scheduler must therefore
issue the instruction over:
2 clock cycles for an integer or floating-point arithmetic instruction,
2 clock cycles for a double-precision floating-point arithmetic instruction, 8 clock cycles for a single-precision floating-point transcendental instruction.
When there are only 4 special function units that execute transcendental functions per multiprocessor, and each warp scheduler can issue an instruction to only 16 cores,
number of execution cycles for transcendental instruction should be like 32/2 = 16 cycles. since out of 4 special function units only 2 will be used for each warp instruction.
Is this correct? I thought that the warp schedulers partitioned themselves based on the phase of the warp number… one scheduler does even warps, one scheduler does odd warps.
Is this correct? I thought that the warp schedulers partitioned themselves based on the phase of the warp number… one scheduler does even warps, one scheduler does odd warps.
One instruction scheduler is responsible for even warps, the other one for odd warps. The cores are not partitioned among the schedulers. It may help to convert the core counts to pipelines with certain instruction throughputs (as Nighthawk13 did). So, a Fermi gf100 (GTX 480, C2050, C2070, …) has fp32 pipelines capable of issuing 2 warps worth of instructions in 2 ticks, same for most int32 instructions, transcendental pipelines capable of issuing 1 warp worth of instructions in 8 ticks, etc. These are listed in Section 5.4.1 of the CUDA C Programming Guide.
Now, at run-time each scheduler issues an instruction from its pool of warps if:
One instruction scheduler is responsible for even warps, the other one for odd warps. The cores are not partitioned among the schedulers. It may help to convert the core counts to pipelines with certain instruction throughputs (as Nighthawk13 did). So, a Fermi gf100 (GTX 480, C2050, C2070, …) has fp32 pipelines capable of issuing 2 warps worth of instructions in 2 ticks, same for most int32 instructions, transcendental pipelines capable of issuing 1 warp worth of instructions in 8 ticks, etc. These are listed in Section 5.4.1 of the CUDA C Programming Guide.
Now, at run-time each scheduler issues an instruction from its pool of warps if: