It’s hard for me to find information about how multiple warps are executed concurrently on the same SM.
I have a Kepler device, so I’m particularly interested for that architecture.
My understanding of how this works is:
- The SMX has 4 warp schedulers which work cooperatively. They have several warps to choose from (perhaps each scheduler has its own group of warps, or perhaps they all use a single group). Each warp has its own instruction pointer. Different warps might be from the same kernel or from different kernels. I assume they negotiate with each other regarding the CUDA cores and other resources that they each need.
- Each scheduler will pick a warp that isn’t currently blocked. The current location in a warp can have up to two instructions for the CUDA cores, and/or instructions for other units. Each instruction is dispatched to 32 CUDA cores for the 32 threads (or those that are active) in the warp.
With 8 possible instructions dispatchable, up to 6 of these can go to the 192 CUDA cores, and so possibly one of the warps has to wait until later. - Whatever happens, it happens to all 32 threads in the warp simultaneously.
- Each of the warps has its own set of registers. I assume there’s some internal mapping per-warp between register numbers, thread Ids, and actual entries in the common register file. Allocation of all the needed registers in the register file can limit the number of warps that can be actively scheduled.
- I assume that the scheduler sticks to the same one or two warps to dispatch as long as they aren’t blocked. Another possible policy would be some sort of round-robin selection.
For an actual example, I ran the ‘clock’ sample program. Each warp does some computing and puts a start and end clock() time in a return array (but only thread 0 does this), indexed by the block index.
There are 128 blocks and 256 threads. From the returned clock values, it’s evident that 8 blocks at a time are running more or less overlapped, and each group of 8 blocks starts a little bit before the previous group ends.
If I change the code to specify 160 threads, it runs 12 blocks at a time.
Evidently, the SMX can run up to 2048 concurrent threads. All threads in a block are presented to the scheduler at a time. So this accounts for running either 8 or 12 blocks at a time.
One thing that puzzles me is why it doesn’t use more SMX units, since the device has 15 of them. Of course, I wouldn’t want it to use all 15, since it’s also my system display device. But I would like to see if I can run the same kernel on 2, or even 10, SMX’s together.
I’ve just been searching for anything about multiple SMs for one kernel, and all I found were ways to run more than one kernel on one SM, and ways to force a kernel to run on the same SM all the time. Nothing on what I was looking for.
So my question for you all is: Can I run a single kernel invocation on multiple SMs concurrently, and how?. Related question: Does use of shared memory in the kernel require it run on a single SM?