Understanding instruction dispatching in Volta architecture

zingdle · December 11, 2019, 1:00am

According to Volta whitepaper, there’re 4 processing block within an SM. A processing block has 1 Warp Scheduler and 1 Dispatch Unit. Also it has 8 FP64 units, 16 INT units, 16 FP32 units, 8 LD/ST units and 1 SFU.

Here’re some of my questions regarding instruction dispatching:

There’s only 1 dispatch unit so we can’t exploit ILP within a warp?
A warp contains 32 threads. Does this means that it requires 32 / 8 = 4 cycles to dispatch all 32 double precision instructions in a warp, and 32 / 16 = 2 cycles for int / single precision instructions, 32 / 8 = 4 cycles for load / store instructions, 32 / 1 = 32 cycles for special function instructions?
If we’re requesting coalesced memory (say ideal 1 transaction / request), can Volta do it on 1 LSU in 1 cycle?

zingdle · December 11, 2019, 1:01am

By the way, I kept seeing error code 15 this request was blocked by the security rules when posting. I have no idea what is wrong for my post content. When I trying to reach Contact at https://developer.nvidia.com/contact, 403 Forbidden occurred after send message.

mnicely · December 11, 2019, 4:49pm

https://docs.nvidia.com/cuda/volta-tuning-guide/index.html

1.4.1.1. Instruction Scheduling
Each Volta SM includes 4 warp-scheduler units. Each scheduler handles a static set of warps and issues to a dedicated set of arithmetic instruction units. Instructions are performed over two cycles, and the schedulers can issue independent instructions every cycle. Dependent instruction issue latency for core FMA math operations are reduced to four clock cycles, compared to six cycles on Pascal. As a result, execution latencies of core math operations can be hidden by as few as 4 warps per SM, assuming 4-way instruction-level parallelism ILP per warp. Many more warps are, of course, recommended to cover the much greater latency of memory transactions and control-flow operations.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-instruction-throughput

5.4.1. Arithmetic Instructions
In this section, throughputs are given in number of operations per clock cycle per multiprocessor. For a warp size of 32, one instruction corresponds to 32 operations, so if N is the number of operations per clock cycle, the instruction throughput is N/32 instructions per clock cycle.

For 64-bit floating-point add, multiply, multiply-add on CC 7.x, N = 32. Note, N = 2 on 7.5.

Maybe you can try https://devtalk.nvidia.com/default/board/359/forum-feedback/

Greg · December 11, 2019, 7:59pm

- There’s only 1 dispatch unit so we can’t exploit ILP within a warp?

Instruction level parallelism can also be accomplished if a sequence of instructions have no dependencies between each other. An execution dependency would require a stall until the dependency is resolved. This can be done even to the same pipeline.

- A warp contains 32 threads. Does this means that it requires 32 / 8 = 4 cycles to dispatch all 32 double precision instructions in a warp, and 32 / 16 = 2 cycles for int / single precision instructions, 32 / 8 = 4 cycles for load / store instructions, 32 / 1 = 32 cycles for special function instructions?

Your statement is correct for most double precision, fp32, integer, and special function instructions that are dispatched to execution units in the SM sub-partitions. Load/store instructions are issued to a unit that is shared by 4 SM sub-partitions (Kepler - Volta, excluding GP100). Load/store instructions are dispatch to a FIFO for the shared unit.

- If we’re requesting coalesced memory (say ideal 1 transaction / request), can Volta do it on 1 LSU in 1 cycle?

The answer depends on the instruction and the GPU.

For a 32-bit shared load or global load the GV100 LSU unit tag unit can resolve 32 threads in 1 cycle. If the data is in the cache or shared memory and there is no bank conflict then the LSU unit can return 128B/cycle. If the access is a 64-bit shared load or global load then this would result in 2 cycles as the return path is the limiter.

zingdle · December 12, 2019, 3:03am

Thanks for your answers! I’m still confused about:

Even if a sequence of instructions have no dependencies, there’s only 1 dispatch unit. Is one dispatch unit capable of issuing multiple instructions at the same cycle to achieve ILP? Or there’s some other methods for this?
So a load/store instruction is first issued to this shared unit and then they’re executed in sub-partition’s LSU? Is it possible that memory instruction in sub-partition 1 get executed in sub-partition 2?

Greg · December 12, 2019, 7:41pm

- Even if a sequence of instructions have no dependencies, there’s only 1 dispatch unit. Is one dispatch unit capable of issuing multiple instructions at the same cycle to achieve ILP? Or there’s some other methods for this?

Instruction parallelism is a form of Instruction Level Parallelism. Multi-instruction dispatch is not a requirement of instruction level parallelism.

- So a load/store instruction is first issued to this shared unit and then they’re executed in sub-partition’s LSU? Is it possible that memory instruction in sub-partition 1 get executed in sub-partition 2?

The instruction is issued to an instruction queue (along with registers and constants) and executed on the shared SM level execution unit. In the logical SM model in the whitepaper the LDST boxes should be closer to the Tex boxes. LSU and TEX are shared execution units that are timesliced between sub-partitions.

Is it possible that memory instruction in sub-partition 1 get executed in sub-partition 2?

No. LSU is an SM level shared execution unit.

Topic		Replies	Views
Clarifing the process of issuing instructions on CUDA devices CUDA Programming and Performance	5	336	March 26, 2024
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15600	February 4, 2011
GT 200 performance questions Is it possible to achieve IPC > 1? CUDA Programming and Performance	5	4118	January 7, 2009
Basic question about warps CUDA Programming and Performance	14	6603	June 9, 2009
Warp threads execution model CUDA Programming and Performance	8	2770	January 19, 2010
Threads Dispatching : 2 different instructions per cycles? CUDA Programming and Performance	2	41	January 31, 2025
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	2973	October 5, 2022
Questin regarding latency CUDA Programming and Performance	6	4246	August 26, 2010
A Question about how Ampere/Lovelace (RTX 3000/4000, GA10X/AD10X) cards handle Warp Dispatching CUDA Programming and Performance	13	461	June 1, 2024
Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu CUDA Programming and Performance	7	5567	July 28, 2009

Understanding instruction dispatching in Volta architecture

Related topics