Previously, I was quite keen on using the H800 due to its distributed shared memory. But now it seems that as long as I can confine the computation within a single SM, using a single maximum-sized block should suffice. Why do some algorithms opt for multiple smaller blocks? There’s an overhead for launching each block, so wouldn’t it be better to use a persistent block? Upon closer thought, it appears that using larger blocks is generally better, especially considering that warps and blocks aren’t strictly bound together. Also different blocks have different shared memory, using one large block can have a lager shared memory.
Certainly if you advance that statement with no supporting evidence, nobody can really argue the point. However we can say that a single block per SM is not sufficient to achieve full theoretical occupancy on a H800. Whether that matters or not cannot be answered in a general way - it depends on the code. But there is a notion that there is a general statistical correlation between higher achieved occupancy and higher performance. It’s not a hard and fast rule, just a notion or statistical correlation. However this does not address the question of why would you use smaller blocks.
One reason that I know of to use smaller blocks in certain cases (lets say using 512 vs. 1024) is because certain GPUs do not have a maximum thread complement per SM that is whole number divisible by 1024. Stated another way, you cannot achieve maximum occupancy on e.g. a cc8.6 GPU with blocks of size 1024 - each SM would only be 2/3 full, and would not have space for another block. But with a size of 512, you could put 3 such blocks on a cc8.6 SM, and achieve full occupancy.
Another possible set of reasons are related to the tail effect and other related considerations. The tail effect is also considering efficient use of the GPU - keeping it fully occupied for as much as possible of the program duration. The general method to address the tail effect (matching the number and size of waves to the size of the GPU) assumes that the work per thread is fairly predictable and uniform. But what if this is not the case? Let’s take an extreme example - each threadblock has one thread that does lots of work, taking a long time, and other threads (1023) that finish their work quickly. That block will not retire until the last thread is finished. To some degree (this varies by GPU architecture and is not well specified, but the general idea is that newer GPUs may behave “better” than older ones in this respect) the block will hold on to SM resources until the last thread is finished. Now, if it is holding onto resource for a 1024 thread block, that may be less advantageous than if it is holding on to resources for a 128-thread block.
I imagine there may be other considerations, perhaps others will chime in. But when I am teaching CUDA to beginners, I point out that without other considerations/needs, the choice of threadblock size is somewhat arbitrary, and in many cases the choices of e.g. 128, 256, and 512 have little to recommend one over the other. So I don’t want to overemphasize the idea that threadblock sizing choices are critical in every case, mysterious, and must be agonized over. In most cases that I have run into, threadblock sizing choices (amongst appropriate guides, not too small, not too large, and multiple of 32) are not very critical to either behavior or performance.
Another point is that large threadblocks reduce the number of available registers per thread
In my experience the overhead of launching a block is negligible. Traditional guidance for CUDA programmers is that one should strive to have at least two thread blocks running on an SM concurrently to exploit available hardware resources fully. I am not aware of evidence that invalidates this conventional wisdom on newer GPUs (compute capability >= 8.0).
My standard recommendation (a rule of thumb) for block sizing is to start the design process with a block size of between 128 and 256 threads that is a multiple of 32, and deviate from this only if there is a good reason to do so: sometimes a use case suggests a particular “natural” block size, other times profiler data will indicate how to modify the block size for best performance. The smallest block size I have seen useful in practice were blocks with 64 threads, and the largest useful block size was 1024 threads at the other extreme.
because certain GPUs do not have a maximum thread complement per SM that is whole number divisible by 1024. Stated another way, you cannot achieve maximum occupancy on e.g. a cc8.6 GPU with blocks of size 1024 - each SM would only be 2/3 full, and would not have space for another block. But with a size of 512, you could put 3 such blocks on a cc8.6 SM, and achieve full occupancy.
Why…? Like… using GPU occupancy calculator, for matmul, we can know register usage: 128, shared memory 1288224=16384KB, 256 threads, using cc 8.6
So we know
Limited by Max Warps / Blocks per Multiprocessor | 6 |
---|---|
Limited by Registers per Multiprocessor | 2 |
Limited by Shared Memory per Multiprocessor | 5 |
If we have a threadblock have 512 threads, using still 128 register, and 16384*2KB shared memory, this will sacrifice active warps? An obvious benefit is, two blocks can share one shared memory, even the A data can be reused.
occupancy calculator Really? TB size will influence register number?
You can see all hardware limits in the programming guide, Section 16.2. 1. Introduction — CUDA C Programming Guide
Each SM has 65536 registers. For example, these could be used a single block of 1024 threads with 64 registers per thread, or two blocks size 1024 with 32 registers per thread.
1024 threads with 70 registers would not be possible.
Some architectures allow only 32768 registers per block, i.e. 1024 * 32.
Regarding the 2/3 occupancy, you can see from the linked table that the maximum number of threads per SM for cc8.6 is 1536. If you were to use a block size of 1024, only a single block can fit the SM. 512 “thread slots” on the SM would be unused.
Why? If I utilize enough resources, using one block can get a large enough shared memory…
Yeah, for thread number, it is fixed, we want to use a fixed number of thread in one SM, we are just discussing how many block to contain them, like, whether 1 block contains 1024 threads? Or 2 blocks contain 512 threads per block?
Different GPU generations provide a different mix of resources. By breaking up work in a fairly fine-grained fashion, but not overly fine-grained, one increases the likelihood that most of available resources can be used, across the four or five GPU generations that matter at any particular point in time.
The guidelines I gave generally provide good results across many different kind of GPUs and use cases. That does not mean they are necessarily optimal for any one particular scenario. They are a good starting point for a software design, as stated. It is impossible to give hard and fast rules that provide optimal results across all uses cases and hardware platforms.
I would encourage explorative approach that uses different block and grid configurations and evaluates them with the help of the CUDA profiler, ideally on more than one GPU architecture, to get a good feel for how your use case is best mapped to the hardware.
Because on a cc8.6 GPU there is a hardware limit for the number of threads per SM that is 1536, see here (“Maximum number of resident threads per SM”). I won’t be able to answer the question “why is there a HW limit?” That is how the CUDA GPU designers designed the cc8.6 SM.
This hardware limit doesn’t have anything to do with register usage or any other aspect of your code. I don’t need to refer to the occupancy calculator to determine this.