Is there a limit for the maximum thread blocks?

XXD · July 29, 2024, 1:48am

I have a A10 model card, and the threads per block can not be more than 1024, as I found in the queried cudaDeviceProp. But it seems there is no limit about the thread block number. I tried more than 10,000 blocks and got no error.
So the maxium number of blocks, if such limit do exist, must be a very big number.
In the programming guide, it says all blocks must execute independently. And blocks must able to be executed serially. Does that mean if we do not consider the thread block cluster feature, which is not available on my platform, dependency between thread blocks will possibly cause a dead lock?

Robert_Crovella · July 29, 2024, 1:58am

There are hardware maximum limits for threadblocks (i.e. grid dimensions) which are published in the programming guide, as well as visible in the deviceQuery sample app (and you can get them from cudaDeviceProp query as well.)

If you use a 1D grid, you can launch up to 2³¹-1 threadblocks in the X-dimension.

Yes, if you create some sort of dependency between threadblocks, and don’t pay any attention to other factors, you could definitely create a deadlock.

XXD · July 29, 2024, 2:03am

As I understand, if the thread blocks are very many, and a consumer block is scheduled up while the producer block is not, this will be a dead lock? If so, will a grid level barrier solve the problem? For example, the consumer block calls grid.sync() before entering the busy wait loop? And if barrier is the way to solve this, before the cooperative group feature is introduced, there is no device or grid level barrier, if I remember correctly? Can thread blocks dependency be utilized without causing dead lock back then?

Robert_Crovella · July 29, 2024, 2:07am

Possibly, possibly not. It depends on actual kernel design. Such questions can’t be given absolute answers when there is no code to study.

grid.sync() is illegal in a kernel that is not launched properly, i.e. that has more blocks than what can be simultaneously scheduled on the GPU you are running on (as well as the need to meet other requirements). Therefore the question is not sensible if there are “many blocks”. It is illegal to do that. If the kernel is launched properly, there is no particular reason (in the absence of any code) to assume that it will somehow lead to deadlock to make use of it.

shawda.ssik600015 · July 29, 2024, 2:53am

I wrote two simple kernel functions as test:

__global__ void dependent_threads(int *i){
    if(threadIdx.x == 0){
        atomicAdd(i,1);
    }

    if(threadIdx.x == 1000){
        while(atomicExch(i, 0) ==0);
        printf("Data recieved\n");
    }
}
__global__ void dependent_blocks(int *i){
    if(threadIdx.x == 0){
        atomicAdd(i,1);
    }

    if(threadIdx.x == 1000){
        while(atomicExch(i, 0) ==0);
        printf("Data recieved\n");
    }
}

And the host function launch them respectively as 1024 threads/block, 1 block and 1 thread, 1024 blocks:

int main(){
    int * ptr;
    CHECK_CU(cudaMalloc(&ptr, sizeof(int)*16));
    dependent_threads<<<1,1024>>>(ptr);
    CHECK_CU(cudaDeviceSynchronize());
}


int main(){
    int * ptr;
    CHECK_CU(cudaMalloc(&ptr, sizeof(int)*16));
    dependent_blocks<<<1,1024>>>(ptr);
    CHECK_CU(cudaDeviceSynchronize());
}

And the result is, the kernel dependent_threads is working normally. But the kernel dependent_blocks finishes with no output. I didn’t see any error code returned. But also no results. It does not seem like a dead lock, or the driver/compiler detects the kernel is not a legal one and terminates it? But why no error code?

Robert_Crovella · July 29, 2024, 3:00am

That is not reflected in the code you have posted. In the code you have posted, both kernel launches are launching 1 block of 1024 threads. They are basically identical other than the kernel name.

Furthermore, your posted kernels are not different in any way except for the kernel name.

shawda.ssik600015 · July 29, 2024, 3:09am

Sorry for my mistake, the code I use for dependent_blocks should actually be like:

__global__ void dependent_blocks(int *i){
    if(blockIdx.x == 0){
        atomicAdd(i,1);
    }

    if(blockIdx.x == 1000){
        while(atomicExch(i, 0) ==0);
        printf("Data recieved\n");
    }
}

int main(){
    int * ptr;
    CHECK_CU(cudaMalloc(&ptr, sizeof(int)*16));
    dependent_blocks<<<1024, 1>>>(ptr);
    CHECK_CU(cudaDeviceSynchronize());
}

And the dependent_threads kernel should be:

__global__ void dependent_threads(int *i){
    if(threadIdx.x == 0){
       atomicAdd(i,1);
    }

    if(threadIdx.x == 1000){
        while(atomicExch(i, 0) ==0);
        printf("Data recieved\n");
    }
}
int main(){
    int * ptr;
    CHECK_CU(cudaMalloc(&ptr, sizeof(int)*16));
    dependent_threads<<<1,1024>>>(ptr);
    CHECK_CU(cudaDeviceSynchronize());
}

Robert_Crovella · July 29, 2024, 3:41am

So you don’t ever initialize the atomic location?

If so, that is undefined behavior. cudaMalloc, like malloc from the C standard library, does not initialize anything.

If I do proper initialization, I see the same behavior from both codes:

# cat t238a.cu
#include <cstdio>
#define CHECK_CU(x) x

__global__ void dependent_blocks(int *i){
    if(blockIdx.x == 0){
        atomicAdd(i,1);
    }

    if(blockIdx.x == 1000){
        while(atomicExch(i, 0) ==0);
        printf("Data recieved\n");
    }
}

int main(){
    int * ptr;
    CHECK_CU(cudaMalloc(&ptr, sizeof(int)*16));
    cudaMemset(ptr, 0, sizeof(int)*16);
    dependent_blocks<<<1024, 1>>>(ptr);
    CHECK_CU(cudaDeviceSynchronize());
}
# nvcc -o t238a t238a.cu
# compute-sanitizer ./t238a
========= COMPUTE-SANITIZER
Data recieved
========= ERROR SUMMARY: 0 errors
# cat t238b.cu
#include <cstdio>
#define CHECK_CU(x) x

__global__ void dependent_threads(int *i){
    if(threadIdx.x == 0){
       atomicAdd(i,1);
    }

    if(threadIdx.x == 1000){
        while(atomicExch(i, 0) ==0);
        printf("Data recieved\n");
    }
}
int main(){
    int * ptr;
    CHECK_CU(cudaMalloc(&ptr, sizeof(int)*16));
    cudaMemset(ptr, 0, sizeof(int)*16);
    dependent_threads<<<1,1024>>>(ptr);
    CHECK_CU(cudaDeviceSynchronize());
}
# nvcc -o t238b t238b.cu
# compute-sanitizer ./t238b
========= COMPUTE-SANITIZER
Data recieved
========= ERROR SUMMARY: 0 errors
#

Curefab · July 29, 2024, 8:41am

For syncing grid-wide, you need to launch a cooperative kernel with cudaLaunchCooperativeKernel CUDA Runtime API :: CUDA Toolkit Documentation

The documentation says

The total number of blocks launched cannot exceed the maximum number of blocks per multiprocessor as returned by cudaOccupancyMaxActiveBlocksPerMultiprocessor (or cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified by the device attribute cudaDevAttrMultiProcessorCount.

If you want to do the producer-consumer pattern, why not - for each block - let the first half of the threads produce, and the second half consume?

XXD · July 30, 2024, 3:09am

If you want to do the producer-consumer pattern, why not - for each block - let the first half of the threads produce, and the second half consume?

Yes, that is a natural solution. But actually I am not looking for the best parallel solution for a particular problem, I am just trying to figure out how the thread block works in cuda.
At the first chapter of the programming guide, the programming model gives me such impression, the difference of inner-block and inter-block is that each block should be independent. Thread blocks are expected to work without relying on result of other blocks.
But when discussing this with some developers (I am not sure if they are experts on cuda), I was told it is possible to cooperate between thread blocks. My first thought was this requires a wider barrier, and they said we can use atomic operations to implement the barrier. Then I am worrying about the dead lock, because I think inter-block dependency is not guaranteed to work that way if the blocks are very many.
I tried the code discussed above, and as Robert said, with correct initialization, the two kernels have the same result. I don’t know if there is a kernel that can better demonstrate the dead lock problem caused by block dependency. But according to this specific case, dead lock didn’t happen. And I am still confused.
Maybe I should put my question this way, what happens if the thread block number of a kernel exceeds the maximum resident blocks per SM multipled by total SM number? Will the waiting blocks keep waiting until the former blocks exits, or they can be scheduled up early? My understanding of the programming model is, the blocks may wait until the former blocks finish, and that is why programmers are asked to write kernel in a way that blocks can work independently.
Is the model in the programming guide only a defensive, conservative description, or did I totally misunderstand it? Or, since cuda has gpu side stack, context switching of blocks is perfectly done, so any thread depdendency that works within a block will also work between blocks?

Robert_Crovella · July 30, 2024, 4:14am

Blocks not yet scheduled will wait until there is space available on a SM to schedule them. While CUDA does have a preemption mechanism, its not something programmers can rely on for general use.

So a block waiting on a block that is not yet scheduled (and cannot be, because the GPU is full and waiting) is a possible deadlock mechanism.

Here is an example:

# cat t240.cu
#include <iostream>
__global__ void k(volatile int *d, const int sz){

  atomicAdd((int *)d, 1);
  while (*d < sz){};
}

int main(int argc, char *argv[]){
  int nb = 10000;
  int *d;
  cudaMallocManaged(&d, sizeof(d[0]));
  *d = 0;
  if (argc > 1) nb = atoi(argv[1]);
  k<<<nb,1>>>(d, nb);
  cudaDeviceSynchronize();
  std::cout << *d << std::endl;
}
# nvcc -o t240 t240.cu
# ./t240 1392
1392
# ./t240 1393
^C
#

My L4 GPU happens to have 58 SMs, cc8,9. Each cc8.9 SM can have a maximum of 24 threadblocks. 24x58=1392

So as soon as I exceed this number, there will be at least 1 threadblock that is waiting in the queue and never gets scheduled, because the GPU is full of threadblocks waiting for the count to reach a certain level. It will never get to that level because some threadblocks can’t get scheduled. So you have deadlock.

Yes, that is what they will do. They cannot be scheduled “early” if there is no room to do so, and the GPU will not use preemption in a case like this.

If you want to avoid such a deadlock, then you have to guarantee that all (necessary) threadblocks will be scheduled, i.e. deposited on a SM.

As one example, it is for this reason that the grid sync mechanism in cooperative groups has specific limits on grid sizing:

To guarantee co-residency of the thread blocks on the GPU, the number of blocks launched needs to be carefully considered.

(emphasis added)

If you don’t have inter-block dependencies, then most of the above discussion can be discarded, and kernels can be generally designed to use as many blocks as is desired, up to relevant hardware limits.

It looks to me like your A10 GPU has 72 SMs. These are Ampere (cc8.6) SMs, each of which can hold 16 threadblocks maximum. 16x72 = 1152

So I would surmise that if you ran the above test case like I did, when you use 1152 as a command-line argument, the code will run normally and print out 1152. If you try 1153 or a larger number, then the code will hang.

Curefab · July 30, 2024, 8:01am

Blocks with no interdependency

This is the typical case for 99% (figure of speech) of kernels. On Cuda you try to solve tasks, which can be parallelized and the separation of tasks into independent subtasks is shown to the Cuda system by the separation into blocks.

This also helps to make the same kernel run on different Cuda architectures and within architectures GPUs of different power.

Blocks which share resources, but the order is irrelevant

The next step are blocks, which share resources, but where the order of execution is irrelevant. The best case (from a synchroniziation view) would be, if the blocks are executed serially.

That is what atomic operations are for. From primitives (e.g. atomicAdd) to locking structures in global memory during accesses.

Also it is not enough to have memory access ordering guarantees, but you would also need guarantees that blocks would see the results of other blocks at some time. That is not trivial for more complex architectures, e.g. GPUs, which are actually 2 or 3 GPUs inside with separate L2 caches.

Implicitly co-resident blocks

Some ‘hackers’ found out that with the synchronization instructions from the previous section one can build interdependent blocks like producer-consumer models. And generally keep blocks resident, let them gather their own work packages, specialize each block for specific tasks.

That was kind of hacky, as there was no guarantee by Nvidia that the blocks would really be loaded concurrently.

And if the GPU was also used for other tasks, e.g. GUI, it could be that there were not enough free resources for quite some time.

One could safeguard it a bit more by testing (e.g. by atomically counting) how many blocks where resident.

Cooperative kernels

Nvidia saw the usefulness (for optimization or and clear kernel design) of those approaches and sanctioned it with cooperative kernels. Now you get guarantees (or errors, if it does not work) on the host side at launch time of the kernel and more powerful synchronization methods between blocks.

In cooperative kernels blocks are always co-resident.

Thread Block Clusters

This is kind of a compromise between cooperative kernels and independent blocks with no interdependency.
On the high-level, there are still independent clusters. On the mid-level blocks within a cluster are co-resident and cooperate. There are even new ways to sync or exchange shared memory contents.

It was always possible to do this within a block by assigning each warp a different task and synchronize between them (even compute capability 1.0 allowed 16 warps per block), but now it can be done with more resources available (several SMs) or for kernels, where it is not possible to parallelize enough independent tasks. Each Cuda generation has more SMs (smaller structure sizes), so the needed number of independent blocks increased over the years. Thread block clusters make it easier, as less independent subtasks are needed.

shawda.ssik600015 · August 5, 2024, 6:13am

That is a very clear and convincing demonstration. And I duplicated the result on several machines, all the results square with your explanantion.

But now I have a further question. Is the upper bound for such a kernel with block dependency a fixed value, or does the upper bound vary according to overall workload on the device? Take the A10 card as an example, the maximum resident blocks be 1152, and the demo kernel with 1153 blocks will be stuck in dead lock, so 1152 is the upper bound in that case. If there are already some other tasks running on the device, wouldn’t the upper bound be even smaller than 1152?

I thught the upper bound will be changable, and I tried the following:

Launch a long task to occupy the device
At the same time run the demo kernel above with different blocks

But to my surprise, the limit is also 1152. The first task would at least have occupied some SM, but the limit does not even decrease by 1. And then I did another test:

Launch the demo kernel with 1153 blocks, it is stuck.
nvidia-smi shows the gpu utility is 100%.
At the same time run another process of the demo kernel above with 1152 blocks

And the second process with 1152 still completes smoothly like before, like the first krenel with 1153 blocks that occupied the gpu with 100% utility doesn’t make any difference. Does different kernels each has a complete view of hardware resources, and therefore the limit is always the same (for A10, always 1152) ?

Robert_Crovella · August 5, 2024, 3:02pm

Details probably matter. If that other task eventually finishes and exits, (thus 'freeing up" the device) then I would expect that you could still eventually get 1152 blocks of the test code in this thread to be resident. But if the other task “never” exits, then it is going to be using up some of the device, and this kernel, needing 1152 blocks, would never get to full access to the device, and so it would hang. Or at least it would not finish until the other task finishes. But here we have to be careful with the definition of “task”. If “task” is e.g. another thread in the same application, or just the same application itself, then the above comments apply. But if “task” is another process, then see below.

The GPU inter-process scheduler by inspection uses preemption in some cases. Pre-emption allows for an existing kernel to be “swapped out” when a new kernel comes along. So this is one possible explanation for the behavior. So I would expect that if you launch two “long running” kernels from the same process, where at least one requires full occupancy, rather than from different processes, you will again see different/original behavior, just a hang.

system · August 19, 2024, 3:03pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
performance cost of too many blocks? CUDA Programming and Performance	12	2803	December 4, 2018
Test Multi Threading Spinning CUDA Programming and Performance	32	4811	July 20, 2011
Maximizing the number of threads per block leads to longer kernel execution times CUDA Programming and Performance cuda , kernel	12	1800	December 19, 2023
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	52148	June 7, 2022
Global thread barrier CUDA Programming and Performance	78	85673	December 23, 2011
Confusion about setting kernel block and grid size for maximum occupancy CUDA Programming and Performance cuda	11	790	March 30, 2024
Max threads/blocks CUDA Programming and Performance	10	81	September 6, 2024
threads and blocks CUDA Programming and Performance	3	1356	May 7, 2012
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4199	May 13, 2010
max number of block CUDA Programming and Performance	21	17661	April 20, 2010