Beginner's question about concurrent warp execution.

It’s hard for me to find information about how multiple warps are executed concurrently on the same SM.
I have a Kepler device, so I’m particularly interested for that architecture.
My understanding of how this works is:

  1. The SMX has 4 warp schedulers which work cooperatively. They have several warps to choose from (perhaps each scheduler has its own group of warps, or perhaps they all use a single group). Each warp has its own instruction pointer. Different warps might be from the same kernel or from different kernels. I assume they negotiate with each other regarding the CUDA cores and other resources that they each need.
  2. Each scheduler will pick a warp that isn’t currently blocked. The current location in a warp can have up to two instructions for the CUDA cores, and/or instructions for other units. Each instruction is dispatched to 32 CUDA cores for the 32 threads (or those that are active) in the warp.
    With 8 possible instructions dispatchable, up to 6 of these can go to the 192 CUDA cores, and so possibly one of the warps has to wait until later.
  3. Whatever happens, it happens to all 32 threads in the warp simultaneously.
  4. Each of the warps has its own set of registers. I assume there’s some internal mapping per-warp between register numbers, thread Ids, and actual entries in the common register file. Allocation of all the needed registers in the register file can limit the number of warps that can be actively scheduled.
  5. I assume that the scheduler sticks to the same one or two warps to dispatch as long as they aren’t blocked. Another possible policy would be some sort of round-robin selection.

For an actual example, I ran the ‘clock’ sample program. Each warp does some computing and puts a start and end clock() time in a return array (but only thread 0 does this), indexed by the block index.
There are 128 blocks and 256 threads. From the returned clock values, it’s evident that 8 blocks at a time are running more or less overlapped, and each group of 8 blocks starts a little bit before the previous group ends.

If I change the code to specify 160 threads, it runs 12 blocks at a time.

Evidently, the SMX can run up to 2048 concurrent threads. All threads in a block are presented to the scheduler at a time. So this accounts for running either 8 or 12 blocks at a time.

One thing that puzzles me is why it doesn’t use more SMX units, since the device has 15 of them. Of course, I wouldn’t want it to use all 15, since it’s also my system display device. But I would like to see if I can run the same kernel on 2, or even 10, SMX’s together.

I’ve just been searching for anything about multiple SMs for one kernel, and all I found were ways to run more than one kernel on one SM, and ways to force a kernel to run on the same SM all the time. Nothing on what I was looking for.

So my question for you all is: Can I run a single kernel invocation on multiple SMs concurrently, and how?. Related question: Does use of shared memory in the kernel require it run on a single SM?

For a single kernel invocation, the block scheduler will generally distribute its blocks to multiple SMs. There is nothing you need to do to make this happen, and you have little control over block scheduling.

What won’t happen is for a (single) block to occupy more than one SM. The warps in a block (when the block is deposited by the block scheduler) are all resident on one and only one SM. Therefore, if you have a machine with 15 SMs, your kernel launch would need a minimum of 15 blocks to have the opportunity/possibility to use all SMs in your GPU.

There are many reasons why a block (once scheduled) must execute on one SM only. The shared memory resource is one example. Shared memory is a logical per-block resource. Logically, the shared memory that is used by one block is distinct from that used by another block. The physical backing of the shared memory logical resource is a memory array on the SM itself. In order for all threads/warps in a block to be able to have access to the same logical shared memory area, the shared memory logical space must be hosted on the same physical resource. Therefore using this argument, all threads/warps associated with a block must run on the same SM.

Silly me! I looked at %nsmid, expecting it to be 15, but it’s only 1!
I have a GT710 card, which I thought used a Kepler 110 GPU, but apparently not. I also noticed that the GT710 spec says it has 192 cores.

Yes, that GPU has only 1 SM
It is about as small as they come.