A basic question about block scheduling

Hi all,

When all the blocks currently in SM are blocked by some reason (for example, waiting for some data), does the scheduler do the rescheduling to replace idle blocks out of SM and let new blocks in to run?


No, once a block has been launched on an SM, it stays on that SM and consumes a slot, until it is retired (i.e. the threadblock completes and terminates execution). If a SM has a full complement of threadblocks scheduled on it, one or more of those threadblocks must complete before any new threadblocks can be scheduled, on that SM.

Thank you! So in other word, if all blocks current in SMs waiting for result from the blocks outside SM, then it cause deadlock. Right?


Yes, in general such behavior can lead to deadlock. Normally deadlock will not occur if the blocks are simply waiting on memory fetches to be returned. But if you attempt to implement a naive global synchronization, for example, this threadblock scheduling behavior can contribute to the deadlock.

It’s good CUDA programming practice to be sure that your threads can execute in any order, and still produce a correct result.

For example if I write a CUDA kernel that requires all threadblocks to reach a certain point, before any threadblock can continue beyond that point (perhaps via a global semaphore), and I launch enough threadblocks, that is a good recipe for disaster.

Thank you very much!

What does this mean? "Normally deadlock will not occur if the blocks are simply waiting on memory fetches to be returned. "

Would you please give an example? And “Normally” means that there is no guarantee, correct?


Suppose my SM can have a maximum of 8 threadblocks open, and has 8 threadblocks open. And suppose each open threadblock is waiting on a memory fetch:

int a = global_mem[idx];

Even though all 8 threadblocks may be termporarily stalled, this will not result in a deadlock because eventually, the memory requests will be satisfied and the stalls will be cleared.

If, on the other hand, we have a conditional loop:

while ((int a=global_mem[idx]) != 32){
// wait

Then if the global memory location is set to 32 by another threadblock and that threadblock cannot start on any SM because all SMs have a full complement of threadblocks, then a deadlock may occur.

By “normally” I simply meant a memory fetch of the first type above, with no conditional behavior.
For a memory fetch of the second type, where there is some conditional behavior, deadlock is possible, although the memory stall is not the cause of the deadlock.

I think this means that if a thread block is running and loading data from memory it won’t block. This is sometimes called discovered concurrency, i.e. once a thread block starts running it continues running.

There is a guarantee that once a thread block from a kernel starts running, it will continue running unless it explicitly blocks (e.g. cudaDeviceSynchronize).

In general though, thread blocks do not have a guarantee of concurrency. This means that if one thread block is running, it cannot assume that any other thread block is currently running.

It is possible for a program to deadlock if threads in one block wait (i.e. with a spin loop) for a result to be produced by threads in another thread block.

Here is a simple example:

I have a job that have N sub tasks, task 0 can start with no data dependency. task 1 depend on result of task 0, task 2 depends on task 1, …, task N-1 depend on task N-2. (For some reason, these sub tasks cannot be executed in serial)

Now I have N CUDA blocks, I use global variable to observe the schedule sequence (scheduled to be in SMs), then I use the first arrived block to do task 0, the second arrived block to do task 1, …, the last arrived block to do task N-1.

Then no block scheduled earlier will wait for the output from a block that is scheduled later. Can I avoid deadlock though this way? Is there anything I missed that could cause potential deadlock?

Any input would be appreciated.


No, this example should work correctly. It should be tolerant to blocks being scheduled out of order, and all of the dependencies are forward (i.e. there are no cycles).

You just need to make sure that you use the appropriate memory operations to communicate results between blocks through memory correctly.

Thank you very much, Gregory.


Thank you very much, txbob. I have one more question for:

while ((int a=global_mem[idx]) != 32){
// wait

Suppose all the thread blocks are in SMs now. block 1 use this loop to wait for block 0. and block 0 cannot continue without finishing writing this data. Is it possible that:

The block 1 keeps checking this data and block 0 has no chance to change it?


As long as block 0 and block 1 are both active (i.e. scheduled on SMs), it doesn’t matter how much checking block 1 is doing, block 0 will still get access to the location, ie. if block 0 does a write, it will eventually show up there. To make this work properly, it’s usually beneficial to use some combination of volatile, __threadfence(), and/or atomics. You may be interested in the threadfence Reduction cuda sample code:


While it does not do exactly what you are describing, there are similarities, and if you study that code it will be instructive on how to make threadblocks coordinate activity via a global memory semaphore.

Got it. Thank you very much, txbob. It is very helpful.


This is true most of the time, but it is not quite true if your application is using dynamic parallelism. If you use dynamic parallelism, it is possible for a thread block that was running on an SM to become suspended, and then not be resumed until some other blocks finish.

It may help to reason about this case if you assume that the work distribution logic is a FIFO queue (it is actually more complex than this, but this should illustrate the point). A thread block may make it to the front of the queue and start executing. However, if the application uses CDP, it may be suspended, and then put back into the queue. At the point the GPU may be full executing other thread blocks, and the suspended block may not be able to continue executing until a slot is freed up.

There are also several more complex issues that may vary between hardware and driver revisions.

We generally say that thread blocks do not have a guarantee of concurrency because that is the simplest true statement.