Understanding instruction dispatching in Volta architecture

According to Volta whitepaper, there’re 4 processing block within an SM. A processing block has 1 Warp Scheduler and 1 Dispatch Unit. Also it has 8 FP64 units, 16 INT units, 16 FP32 units, 8 LD/ST units and 1 SFU.

Here’re some of my questions regarding instruction dispatching:

  • There’s only 1 dispatch unit so we can’t exploit ILP within a warp?

  • A warp contains 32 threads. Does this means that it requires 32 / 8 = 4 cycles to dispatch all 32 double precision instructions in a warp, and 32 / 16 = 2 cycles for int / single precision instructions, 32 / 8 = 4 cycles for load / store instructions, 32 / 1 = 32 cycles for special function instructions?

  • If we’re requesting coalesced memory (say ideal 1 transaction / request), can Volta do it on 1 LSU in 1 cycle?

By the way, I kept seeing error code 15 this request was blocked by the security rules when posting. I have no idea what is wrong for my post content. When I trying to reach Contact at https://developer.nvidia.com/contact, 403 Forbidden occurred after send message.

https://docs.nvidia.com/cuda/volta-tuning-guide/index.html

1.4.1.1. Instruction Scheduling
Each Volta SM includes 4 warp-scheduler units. Each scheduler handles a static set of warps and issues to a dedicated set of arithmetic instruction units. Instructions are performed over two cycles, and the schedulers can issue independent instructions every cycle. Dependent instruction issue latency for core FMA math operations are reduced to four clock cycles, compared to six cycles on Pascal. As a result, execution latencies of core math operations can be hidden by as few as 4 warps per SM, assuming 4-way instruction-level parallelism ILP per warp. Many more warps are, of course, recommended to cover the much greater latency of memory transactions and control-flow operations.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-instruction-throughput

5.4.1. Arithmetic Instructions
In this section, throughputs are given in number of operations per clock cycle per multiprocessor. For a warp size of 32, one instruction corresponds to 32 operations, so if N is the number of operations per clock cycle, the instruction throughput is N/32 instructions per clock cycle.

For 64-bit floating-point add, multiply, multiply-add on CC 7.x, N = 32. Note, N = 2 on 7.5.

Maybe you can try https://devtalk.nvidia.com/default/board/359/forum-feedback/

- There’s only 1 dispatch unit so we can’t exploit ILP within a warp?

Instruction level parallelism can also be accomplished if a sequence of instructions have no dependencies between each other. An execution dependency would require a stall until the dependency is resolved. This can be done even to the same pipeline.

- A warp contains 32 threads. Does this means that it requires 32 / 8 = 4 cycles to dispatch all 32 double precision instructions in a warp, and 32 / 16 = 2 cycles for int / single precision instructions, 32 / 8 = 4 cycles for load / store instructions, 32 / 1 = 32 cycles for special function instructions?

Your statement is correct for most double precision, fp32, integer, and special function instructions that are dispatched to execution units in the SM sub-partitions. Load/store instructions are issued to a unit that is shared by 4 SM sub-partitions (Kepler - Volta, excluding GP100). Load/store instructions are dispatch to a FIFO for the shared unit.

- If we’re requesting coalesced memory (say ideal 1 transaction / request), can Volta do it on 1 LSU in 1 cycle?

The answer depends on the instruction and the GPU.

For a 32-bit shared load or global load the GV100 LSU unit tag unit can resolve 32 threads in 1 cycle. If the data is in the cache or shared memory and there is no bank conflict then the LSU unit can return 128B/cycle. If the access is a 64-bit shared load or global load then this would result in 2 cycles as the return path is the limiter.

Thanks for your answers! I’m still confused about:

  • Even if a sequence of instructions have no dependencies, there’s only 1 dispatch unit. Is one dispatch unit capable of issuing multiple instructions at the same cycle to achieve ILP? Or there’s some other methods for this?

  • So a load/store instruction is first issued to this shared unit and then they’re executed in sub-partition’s LSU? Is it possible that memory instruction in sub-partition 1 get executed in sub-partition 2?

- Even if a sequence of instructions have no dependencies, there’s only 1 dispatch unit. Is one dispatch unit capable of issuing multiple instructions at the same cycle to achieve ILP? Or there’s some other methods for this?

Instruction parallelism is a form of Instruction Level Parallelism. Multi-instruction dispatch is not a requirement of instruction level parallelism.

- So a load/store instruction is first issued to this shared unit and then they’re executed in sub-partition’s LSU? Is it possible that memory instruction in sub-partition 1 get executed in sub-partition 2?

The instruction is issued to an instruction queue (along with registers and constants) and executed on the shared SM level execution unit. In the logical SM model in the whitepaper the LDST boxes should be closer to the Tex boxes. LSU and TEX are shared execution units that are timesliced between sub-partitions.

Is it possible that memory instruction in sub-partition 1 get executed in sub-partition 2?

No. LSU is an SM level shared execution unit.