Thread block/work-group scheduling on Nvidia GPUs?

Hi all,
I’m fairly new to openCL/CUDA and GPGPU programming and would like to clarify something:
Do work-groups/thread blocks interleave like warps within a work-group/BLOCK on a SM of Nvidia GPUs?
Or they are always serialized, meaning one work-group/block has to retire before the next one comes in?



As long as the SM has enough free resources (registers and shared memory), multiple blocks will indeed run simultaneously and interleaved. This can even include blocks from other independent stream’s kernels.

On Kepler, as many as 16 different blocks can run on an SM simultaneously.

Hi SPWorley,

If this is the case. Is there still any performance benefit of having large work-group size (> multiple of warps, for instance)? My understanding is that large work-groups enable warps (in a work-group) to interleave when there is any stall due to memory latency. If work-group can also interleave like the warps in a work-group do, the memory latency can be hidden in the same fashion. Am I misunderstanding anything?

And one additional question, I found that in some kernel, the best work-group size is actually smaller than Nvidia warp size of 32 (like 4). This is really confusing to me. Does that mean my kernels are using too much shared resources so it works better with having only a few number of threads running in parallel?


It would be extremely unusual to find a kernel where the best warp size is less than 32. Thread-level resources (like registers) are generally allocated for the entire warp, even if you don’t use all the threads in the warp.

What does this kernel do?

Hi Seibert, thanks for your response. (I actually have two questions, would be great if you can help answering the other one)

This kernel functions like a processor: it fetches instructions, loads some data base on fields in the instruction, performs some computation according to instruction opcode, and writes the data back. Here is some pseudo codes,

inst currInst=inst_mem[pc];//read from global memory

data currData=data_mem[currInst.dataAddr];

//modify values in currData

else if(currInst.cmd==GATHER)
//does something else




There should be a lot of memory stall when accessing data memory (which is global to all work_groups) since they are no coalesced. I was expecting large work-group/block to help performance, but it wasn’t the case. There are 1K threads in total, and I am running it on a GTX 680m, which has 7 SM, each SM has 192 cores/stream processors, which, I assumed, allows 6 warps to run simultaneously.

Any guess on why best block/work-group is only 8 in this case? I am really confused.

And does anyone know what is the difference between having 256 blocks of 128 (4 warps) threads and 1024 blocks of just 32 threads(1 warp) (assuming 32k threads in total). I thought blocks cannot interleave, so it makes sense to have really large thread blocks to have warps interleave for hiding stall dues to data dependency, but SPworley pointed out blocks can also interleave, then what is the point of having blocks with large sizes of multiple warps?

Any help is greatly appreciated


The memory address divergence and the instruction execution flow divergence is likely to be very high in the above sample.

Compute capability 2.* devices can support 8 blocks per SM.
Compute capability 3.* devices can support 16 blocks per SM.

The maximum number of threads per block, maximum number of threads per SM, maximum number of registers per thread also vary

Once a block is assigned to a SM there is little differentiation in terms of scheduling between warps from different blocks. There are some behaviors that can be attributes to 1-2 warp blocks and 32 warp blocks.

  • If a kernel heavily uses barriers (syncthreads) then it is recommend to have 2-4 blocks per SM to ensure that there are sufficient eligible warps per cycle.
  • If a kernel addresses memory with block locality (all access in block are consecutive) tend to favor larger block sizes as it reduces the problem set in L1 and local MMU.
  • If a kernel has a large icache footprint and each thread follows the same execution path then larger blocks generally have better icache behavior as warps fetch addresses tend to stay within a few cache lines. Lots of smaller blocks tend to have higher contention on the icache leading to fetch stalls.[./]

    In the end the critical item is to wisely use the register file and share memory while maintaining sufficient eligible warps to ensure the warp scheduler can issue every cycle. If you launch 1 warp blocks then occupancy is limited to 16% (CC 2.0) and 25% (CC 3.*) respectively which is insufficient to hide most latencies.

    In order to understand the GPU architecture and code execution I recommend you use Nsight VSE (or cuda-gdb) and single step your kernel from the perspective of one warp so you can look at the effects of execution flow divergence. Nsight VSE CUDA profiler, Visual Profiler, and nvprof all provide metrics to show both execution efficiency (number of threads per warp that are active per instruction executed) and memory address divergence.