How does a dispatcher, which can only dispatch 1 warp/cycle fit with concurrent execution of FP32 and INT datapaths? Are independent instructions from 1 warp executed at the same time?
Some more and some less official sources state Volta cannot dual-issue:
Kepler-Pascal supported dual-issue but this counts as 1 cycle. On Volta-Turing architecture issue slot utilization is IPC / MaximumIPC or IPC/4.0 per SM.
https://arxiv.org/pdf/1804.06826.pdf (Dissecting the NVIDIA Volta GPU Architecturevia Microbenchmarking)
On Volta there is only one dispatcher in a processing block, and we do not observe dual issue in the generated code.
Do I understand correctly from this post: INT 32 and FP64 can be used concurrently in the Volta architecture? - #3 by nindanaoto
that the dispatch would be overlapping. In Volta/Turing the FP32 and INT32 cores can process 16 threads per cycle each per SM partition.
In Cycles 0-1 an FP32 instruction is dispatched, in cycles 1-2 an INT32 instruction is dispatched, in cycles 2-3 an FP32 instruction is dispatched, and so on? So each cycle one new instruction is dispatched, but the dispatch could take longer than 1 cycle?