performance cost of too many blocks?

What happens if there are more blocks allocated in a kernel than there can be on the device at a given moment? ( ex. if my device has 20 SMs, and there are 2 blocks per SM, but my entire kernel has 4000 blocks )
In the example, there can only be 40 blocks at a time on the device. My question is:

  1. is there a context change among blocks while blocks are executed, or does a block run to completion before a waiting block is run?

  2. how does the device decide which block gets to be computed first?

  3. what is the cost of context change among blocks? In which hardware memory (L1, L2, device, local, register… ) does the context of a block reside?

  1. Leaving things like debugging, cuda dynamic parallelism, etc. out of the discussion, then once a threadblock becomes resident on a SM, it stays on that SM until it retires (i.e. until all threads in the block have reached their return or final code statement). This means, for example, that register allocation in the SM register file for a particular resident thread will remain allocated to that thread until that thread completes execution. I’m choosing my words carefully here, because it is possible with careful microbenchmarking, to prove that in some cases, block allocated resources (e.g. registers per thread) can be freed up on a granularity that is less than the whole block (e.g. warp level or thread level granularity), as individual threads retire.

  2. This is unpublished, and in fact the CUDA programming model provides no statements or guarantees about the order of thread execution, or the order of block execution (except those which may be imposed explicitly by the programmer e.g. through the use of execution barriers, etc.) Trying to answer this question is not a good idea IMO, and even dangerous if you attempt to depend on observed or stated behavior. You should condition yourself to believe that CUDA places no order on thread or block execution, and it may even vary run-to-run. You should write code that is correct in spite of this.

  3. There is no context change associated with blocks. Blocks become resident, then stay resident until they retire. There is obviously a preamble block launch cost and a postamble block retirement cost, but these are unpublished and the only way to discover them would be careful microbenchmarking. I don’t happen to know all the resources associated with block residency on a SM, but I would expect that the register file is the primary resource, and the block register file exists in the … SM register file. There may be other resources that have a “memory footprint”. The per-thread stack would be an example. I’m not aware of what they all are, pretty sure they are mostly unpublished. However CUDA has a demonstrable GPU memory cost/overhead. There are various other questions, including recent ones here in this forum, discussing stack memory usage, and how to roughly calculate it for a particular kernel.

Pursuant to item 3 above, there is a general optimization strategy which involves tuning the kernel launch configuration (e.g. number of blocks, with an eye towards occupancy, etc.) to match the device you are running on. However this is a “ninja level” strategy and I doubt in most cases it could make more than a few percent difference in performance.

http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf

Thanks for your answer!

Just a few more clarifications on how blocks are executed:

  1. If a grid-wise synchronization is called, then does the block that hit the synchronization get swapped out for eligible blocks? If so, then is the cost of swapping a stalling block with an eligible block significant?

  2. What is the cost of grid-wise synchronization relative to other synchronization methods?

I assume you are referring to grid-wise synchronization available in cooperative groups.

If you study that carefully, you will see that a proper cooperative grid launch satisfies a requirement that all blocks are SM-resident. There is no swapping of blocks required.

The primary thing that the cooperative launch does is provide a mechanism to guarantee that in such a scenario, all blocks that have not yet met the grid sync point can (will) make forward progress, even when some blocks have met the grid sync point.

Oh okay. Let me study a little more about cooperative groups to understand your answer
Thanks as always

A given GPU has a maximum occupancy limit, or maximum instantaneous thread capacity. This is the number of threads that can actually be resident on SMs for that GPU, so that their instructions can actually be scheduled. (The GPU block scheduler assigns blocks to SMs as their resources permit.)

As an “upper bound”, this is 2048 * # of SMs (for most GPUs; not Turing, however). The SM has a maximum limit of 2048 threads that can be resident, i.e. instruction-schedulable. (In your other question, you are exploring concepts such as resource limitation that indicate that for a particular kernel, the actual thread “capacity” may be lower.)

Whatever this capacity is, it can be known a-priori for a particular kernel and a particular GPU. If you study the programming guide section on cooperative groups:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cooperative-groups

you will see that for a proper cooperative grid launch, at least 2 requirements must be met:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#grid-synchronization-cg

  1. The cooperative launch property must be true:

“You should also ensure the device supports the cooperative launch property, …”

  1. The size of the grid requested in the launch has to meet some limits, which vary by GPU and kernel. In order to programmatically determine the limits, the occupancy API can be used:

"To guarantee co-residency of the thread blocks on the GPU, the number of blocks launched needs to be carefully considered. "

This second requirement, when properly met, means that all blocks can and will be co-resident, for a proper launch. This co-residency means that there need be no “swapping” of blocks in order to allow forward progress of all threads in the grid.

Thanks a bunch :) It really helped.

A follow-up on your answer #2: “This is unpublished, and in fact the CUDA programming model provides no statements or guarantees about the order of thread execution, or the order of block execution”

In Wavenet implementation published by Nvidia (https://github.com/NVIDIA/nv-wavenet), if you take a look at function

launch_manyblock

function inside “nv_wavenet_persistent.cuh” file, you will soon see that a huge number of blocks is allocated ( assuming num_layers = 12, R = 64, S = 256, A = 256, for each sample, we allocate 45 blocks per sample. Since each run produces hundreds of thousands of samples, we are clearly allocating more blocks than there can be present on any GPU ).

However, the blocks have to be executed in the order of sample and layer ( 1st layer of 1st sample should be completed for 2nd layer of 1st sample to proceed, and 1st sample should be completed for 2nd sample can proceed ). So how is Brian Pharris ( the author of the above wavenet and principal architect at Nvidia ) keeping order of the block’s launch?

Inside the same file, you will see that: A layer waits for the previous layer to go to completion by running an inifnite while-loop and keep checking the global memory that the previous layer should have produced. A sample waits for previous sample using mutex and __syncthreads(). And there is some PTX code which I do not understand all the details about: “barrier.sync”.

If this requires too much knowledge about the program, then I will wait for him to respond to an issue I am planning on posting. It’s just that he takes weeks to respond to issues, since there are so many of them.

Thanks

To first order, any “performance cost” associated with thread blocks is entirely negligible. It’s not something CUDA programmers typically think about. You simply create a grid with as many thread blocks as are convenient or necessary for your data-to-thread mapping. You can start with a rule of thumb that a thread block should comprise a multiple of 32 threads between 128 to 256 threads, and take the two-level decomposition process from there.

In most CUDA programs, grids comprise many more blocks than can be active run on the device at any given time. GPU scheduling hardware takes care of scheduling the blocks in the grid for execution on the GPU until they all have been processed. As pointed out previously in this thread, the order in which the GPU hardware will schedule the blocks should be considered indeterminate: You fire off a kernel with a grid of blocks, and at some future time all the blocks will have been executed (provided each block terminates in finite time). There is no inter-block synchronization in the basic CUDA programming model.

On many older GPUs it was beneficial to “oversubscribe” execution resources by at least a factor of twenty if the kernel was memory intensive. In other words, for optimal performance, one wanted the number of blocks in a grid to be at minimum > 20x the number of thread blocks that could be simultaneously run on the GPU at any given time. Not sure whether this desired lower bound still applies to the most recent GPUs.

I sense that this thread stems from some confusion as to how code executes on the GPU. It would probably be a good idea to re-read the sections of the CUDA documentation covering the execution model.

Hi njuffa,

I don’t think I have made my confusion clear to you.

From the first answer by Crovella, I learned 2 things:

  1. the order in which blocks are executed is indeterminable ( as you have also pointed out )
  2. unless cooperative launch is used, a thread block will only swapout with another block once all the threads in the block has run to completion

However, in the above github repository, there is a program that launches hundreds of thousands of blocks. And the blocks must run sequentially ( due to the nature of the sequential wavenet ). This means that most blocks wait for the previous block to be finished ( the waiting is accomplished by continuously checking if certain regions in the global memory has been modified to a valid value. the block waits indefinitely in a while loop until the value is set to a valid one ). So, assuming the above two things, what happens when block 50, 51, 52, … ( I am giving arbitrary block index to demonstrate ) are scheduled, but they are all waiting for block 1 to go to completion and produce valid values in the global memory? This shouldn’t work because the blocks 50, 51 … will never run to completion and thus never retire. If they don’t retire, block 1 will never be on the device, and the program hangs.

However, as you may have predicted, the program in the github works perfectly fine ( in fact it seems to have been written by an engineer at Nvidia ). So what is going on?

This seems to be an assumption / assertion on your part which may or may not hold. I certainly do not know whether that is indeed so. I am not familiar with the code, and frankly I have no interest in finding out how that code works.

I can only hypothesize. Two possibilities, in order of perceived likelihood:

(1) You are misunderstanding how the posted code works
(2) The NVIDIA engineer who wrote it made use of a ninja programming technique that is in a grey area as far as the CUDA programming model goes.

okay. I will have to look into it more carefully. Thanks for your honest thoughts

I never said that, and its not true. To repeat myself, apart from situations like debugging, or cuda dynamic parallelism, my expectation is that a threadblock, once scheduled, always runs to completion. It is never swapped out.

The use, or not, of a cooperative launch doesn’t change that.

Having said this, the CUDA programming model provides no stated guarantees that this will happen or won’t happen (i.e. whether a block stays resident always, or could be swapped out in some cases). But apart from the use of something like cooperative groups, any implied dependency on either scenario in your code is a programming error. (Any expectation that a threadblock will run to completion before being swapped out, to facilitate some need in your code, is an error. Any expectation that a threadblock will be swapped out, to facilitate some need in your code, is an error.)

One methodology to guarantee the order of threadblock execution is to dynamically assign the block ID. Normally we treat blockIdx as the block ID. However it’s possible to ignore that and assign a block ID in a first-come first-served fashion using atomics. This guarantees that the first N resident blocks will have block ID’s from 0 to N-1. This can be used to overcome the lack of block execution order. I don’t know if that is what is happening in Brian’s code. It wasn’t obvious to me that is the case, but after 10 minutes of code study I still hadn’t reached an answer.

Later:
While it doesn’t use atomics per se, it does appear that the key kernel launch (manyblock instead of persistent) does indeed use a dynamic block assignment strategy:

int block_idx = PERSISTENT ? blockIdx.x : blockIdx.x % params.blocks_per_sample;

When PERSISTENT is false, it is a many block launch (the type of launch you are asking about). That type of launch is doing block reassignment. While I haven’t completely parsed it, I think it is possible that this kind of block reassignment can be used to prevent deadlock, and probably does in Brian’s case.