Clarifing the process of issuing instructions on CUDA devices

Hello,

I would to make sure I have a proper understanding of how instructions are issued. One that A100 whitepaper, each sub-partition has a warp scheduler which is claimed to be issuing 32 instructions at once (I assume from the same warp). Each instruction will end up in the appropriate pipeline. I have a couple of specific questions:

  1. No instruction type seem to have enough slots to receive 32 instructions at once (FP32 has 16 and FP64 has 8). Does that mean that they are “queued” in this pipeline and at the next instruction the scheduler can process another ready instruction in another pipeline. Or will the scheduler schedule a half/quarter warp at a cycle and then the rest at the next. If it’s the later then I’m confused about the 32thread/clk claimed by the whitepaper.

  2. I’m trying to get an idea of the theoretical IPC I should be able to achieve to guide my profiling. If I understand correctly, the A100 SM has a throughput of 64inst/clk for INT32. Even if I didn’t take advantage of the other pipelines, I should get an IPC of 64. However for some kernel I profile I get 3inst/cycle which sounds very low. However at the same time Nsight Compute reports an utilization of 80% of the INT32 pipeline and high compute throughput. How to reconcile those two contradicting pieces of information ?

Thanks for the help

  1. a question that gets asked from time to time examples: 1 2 A whitepaper will often be describing the SM as a whole. So if the SM has 64 functional units that handle FP32, for example, and is broken into 4 SMSP, each SMSP will have a throughput of 16 threads/clk, whereas the SM as a whole will have a throughput of 64 threads/clk

  2. I commonly state that “in CUDA, all instructions are issued warp-wide” Therefore I suspect that the confusion here arises because you are counting a warp issue of an instruction as 32, whereas tools and other places may be counting that as 1. Example:

  • Instructions Executed Number of times the source (instruction) was executed per individual warp, independent of the number of participating threads within each warp.
  • Thread Instructions Executed Number of times the source (instruction) was executed by any thread, regardless of predicate presence or evaluation.

Unit 3 of this online training series may also be of interest.

Thank you very much the point 2 is clear to me now.

However for 1. I don’t think things add up yet. In https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf, page 22, we see the diagram for an SM. And there there are clearly 4 sub partitions and each partition has a thread scheduler that can issue 32inst/clk. But associated to this scheduler, there are only 16 INT32 pipelines. Where do go the other 16 instructions ? Do they wait in some queue ? Or does the full warp goes in a single pipeline and each pipeline just has a very high latency ?

Thanks for the great support!

As I mentioned in the two linked examples I provided, if the warp scheduler needs to issue an instruction to a set of functional units whose number is less than 32, the issue will take place over multiple clock cycles.

Not all functional units are fully enumerated in pictorial diagrams nor is a complete description given in any whitepaper that I know of. But AFAIK, for example, the LD/ST unit can accept up to a full warp per clock. Therefore, for some instructions, it is possible for the warp scheduler to issue 32 instructions (threads) per clock. Does it do that always, on every clock cycle? No, it does not, and I don’t think this is in any way contradictory to the white paper documents, which generally describe capability.

So if I say that 16 threads are issued in the first clock, and 16 threads are issued in the second clock, you are wondering how the information for the 16 threads issued in the second clock is provided for? Like a hardware design description of the SM? I don’t have that, and I’m not sure why it would be necessary. I think it should be sufficient to say that the SM has the necessary capability to allow the warp scheduler to issue an instruction warp-wide, but in so doing it may issue the first 16 threads in the first clock, and the next 16 threads in the next clock. I don’t know if there is a memory, a queue, or some other hardware level structure that provides for this capability.

In any event, other than what I have written already, I think I will be unable to shed further light on how a warp scheduler issues an instruction to functional units.

Thanks I was just confused by the warp-wide issue if there wasn’t a single functional unit capable of accepting 32 at a time, which made me think I might be missing something fundamental. But your explanation was great at addressing this.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.