Clarifing the process of issuing instructions on CUDA devices

leclerc1 · March 26, 2024, 7:40pm

Hello,

I would to make sure I have a proper understanding of how instructions are issued. One that A100 whitepaper, each sub-partition has a warp scheduler which is claimed to be issuing 32 instructions at once (I assume from the same warp). Each instruction will end up in the appropriate pipeline. I have a couple of specific questions:

No instruction type seem to have enough slots to receive 32 instructions at once (FP32 has 16 and FP64 has 8). Does that mean that they are “queued” in this pipeline and at the next instruction the scheduler can process another ready instruction in another pipeline. Or will the scheduler schedule a half/quarter warp at a cycle and then the rest at the next. If it’s the later then I’m confused about the 32thread/clk claimed by the whitepaper.
I’m trying to get an idea of the theoretical IPC I should be able to achieve to guide my profiling. If I understand correctly, the A100 SM has a throughput of 64inst/clk for INT32. Even if I didn’t take advantage of the other pipelines, I should get an IPC of 64. However for some kernel I profile I get 3inst/cycle which sounds very low. However at the same time Nsight Compute reports an utilization of 80% of the INT32 pipeline and high compute throughput. How to reconcile those two contradicting pieces of information ?

Thanks for the help

Robert_Crovella · March 26, 2024, 7:56pm

a question that gets asked from time to time examples: 1 2 A whitepaper will often be describing the SM as a whole. So if the SM has 64 functional units that handle FP32, for example, and is broken into 4 SMSP, each SMSP will have a throughput of 16 threads/clk, whereas the SM as a whole will have a throughput of 64 threads/clk
I commonly state that “in CUDA, all instructions are issued warp-wide” Therefore I suspect that the confusion here arises because you are counting a warp issue of an instruction as 32, whereas tools and other places may be counting that as 1. Example:

Instructions Executed Number of times the source (instruction) was executed per individual warp, independent of the number of participating threads within each warp.

Thread Instructions Executed Number of times the source (instruction) was executed by any thread, regardless of predicate presence or evaluation.

Unit 3 of this online training series may also be of interest.

leclerc1 · March 26, 2024, 8:17pm

Thank you very much the point 2 is clear to me now.

However for 1. I don’t think things add up yet. In https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf, page 22, we see the diagram for an SM. And there there are clearly 4 sub partitions and each partition has a thread scheduler that can issue 32inst/clk. But associated to this scheduler, there are only 16 INT32 pipelines. Where do go the other 16 instructions ? Do they wait in some queue ? Or does the full warp goes in a single pipeline and each pipeline just has a very high latency ?

Thanks for the great support!

Robert_Crovella · March 26, 2024, 8:24pm

As I mentioned in the two linked examples I provided, if the warp scheduler needs to issue an instruction to a set of functional units whose number is less than 32, the issue will take place over multiple clock cycles.

Not all functional units are fully enumerated in pictorial diagrams nor is a complete description given in any whitepaper that I know of. But AFAIK, for example, the LD/ST unit can accept up to a full warp per clock. Therefore, for some instructions, it is possible for the warp scheduler to issue 32 instructions (threads) per clock. Does it do that always, on every clock cycle? No, it does not, and I don’t think this is in any way contradictory to the white paper documents, which generally describe capability.

So if I say that 16 threads are issued in the first clock, and 16 threads are issued in the second clock, you are wondering how the information for the 16 threads issued in the second clock is provided for? Like a hardware design description of the SM? I don’t have that, and I’m not sure why it would be necessary. I think it should be sufficient to say that the SM has the necessary capability to allow the warp scheduler to issue an instruction warp-wide, but in so doing it may issue the first 16 threads in the first clock, and the next 16 threads in the next clock. I don’t know if there is a memory, a queue, or some other hardware level structure that provides for this capability.

In any event, other than what I have written already, I think I will be unable to shed further light on how a warp scheduler issues an instruction to functional units.

leclerc1 · March 26, 2024, 8:32pm

Thanks I was just confused by the warp-wide issue if there wasn’t a single functional unit capable of accepting 32 at a time, which made me think I might be missing something fundamental. But your explanation was great at addressing this.

system · April 9, 2024, 8:32pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Threads Dispatching : 2 different instructions per cycles? CUDA Programming and Performance	2	33	January 31, 2025
About the number of CUDA cores in SMSP, less or gerater than warp threads number(32) CUDA Programming and Performance	8	809	June 17, 2024
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15585	February 4, 2011
Basic question about warps CUDA Programming and Performance	14	6587	June 9, 2009
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24850	September 6, 2009
A Question about how Ampere/Lovelace (RTX 3000/4000, GA10X/AD10X) cards handle Warp Dispatching CUDA Programming and Performance	13	450	June 1, 2024
How the 16 int cores in a processing block in SM execute when 32 integers in a warp is calculated? CUDA Programming and Performance cuda , board-design	4	1050	September 28, 2023
Things related to stall reasons... or not so related CUDA Programming and Performance	6	1995	April 14, 2017
Understanding instruction dispatching in Volta architecture CUDA Programming and Performance	5	3503	December 12, 2019
How many parallel threads? CUDA Programming and Performance	19	9970	October 1, 2021

Clarifing the process of issuing instructions on CUDA devices

Related topics