Block size and occupancy

piotr33 · December 18, 2024, 4:54pm

I have a vast number of blocks to saturate the hardware, but for algorithmic reasons, the preferred number of threads per block is 32. This also nicely matches the hardware warp size of 32 threads. Each thread works with 4 bytes of data, making coalesced memory access patterns of 32*4=128 bytes. I thought this design was optimal, but the occupancy calculator says there is 50% thread occupancy for an 8.9 device.

So, where does the remaining 50% go at the HW level? That would suggest the warp size is 64 and the global memory bus width is 256 bytes, while it is half of that, AFAIK. What is the tool really trying to tell me?

EDIT: in other words, my assumption is that the number of blocks per SM is way beyond its HW saturation point, and there is enough shared memory capacity to enable multiple block instances to be simultaneously active. That would imply that SM would be equally happy working with 32-entry warps originating from various blocks instead of 32-entry warps from a single wider thread block. The memory bus width is saturated, all the 32 SIMT units have something to do, i.e. 100% occupancy. But the calculator says it’s 50%, which escapes me.

Robert_Crovella · December 18, 2024, 5:07pm

The maximum number of resident blocks on a SM for cc 8.9 is 24. You would need 48 blocks, if each block is 32 threads, to hit the maximum thread complement of 1536.

You’re running into a hardware limit that prevents full occupancy with the choice of 32 threads per block. For the things you have under your control, one of the options to fix the report is to move to 64 threads per block, or higher. This does not indicate that the warp size is actually 64; its just another hardware limit of the device.

This (50% occupancy) may or may not matter for performance. Higher occupancy is somewhat correlated with higher performance in a GPU, but that is a very general statement, and not true in every case.

piotr33 · December 18, 2024, 5:38pm

Hello Robert,

Thank you for your answer. I think I am missing the details of mapping between thread blocks and warps at the lowest level. I presumed that thread blocks are just chopped into warps, and the warps are scheduled for SIMT execution when all their data dependencies are resolved. It is clear that other capacity restrictions apply (registers, shared memory, the 24 resident block limit you mentioned, and the 128-byte aggregated global bus width). Still, the SM should otherwise be flooded with warps. In this sense, the HW is already 100% utilised in terms of RAM bandwidth and ALU assignments; there should be no performance gain by adding more threads to a block, which is in line with what you wrote.

So, what hardware capacity is underutilized? I can (reluctantly) bump the block size to 64, but why would I want to do it in the first place?

Robert_Crovella · December 18, 2024, 5:53pm

The GPU generally likes to have full occupancy. One of the most important reasons for this is due to the idea of latency hiding. You can find many descriptions of this in various forum write-ups, and it is covered in an organized fashion in unit 3 of this online tutorial series.

Briefly, just because the SM has, say, 24 warps, does not mean in all cases (i.e. in every clock cycle) that it can find a warp that is ready to issue a new instruction. But the SM (or SMSP) is operating at highest throughput when it can issue an instruction in every clock cycle. Warps can be stalled for various reasons, the most common one being waiting on a dependency to be satisfied, before execution (of that warp/threads) can continue.

If you had 24 warps (assigned to a SM), then you will have up to 24 warps to choose from. If you had 48 warps, then you will have 48 warps to choose from (the details are more complicated than this, because a cc8.9 SM is really broken into 4 SMSPs, but for general understanding of the idea of latency hiding, we can consider things at an aggregate level.)

If you have 48 warps to choose from, then in some cases, it will be “less likely” (as compared to the 24 warp case) that in a given clock cycle, there are no “eligible warps”. In that case, average code throughput increases or could be higher.

Again, this is not a blanket guarantee. Higher occupancy is somewhat correlated to higher performance, but it is not 100% correlated. I can’t say whether a block size of 64 would actually help your code, from a performance perspective.

That isn’t the definition of occupancy, at least not the way NVIDIA tools use the word occupancy. Occupancy is the number of threads actually resident on a SM compared to the number of threads that could be resident on a SM. Even this has a couple different ways to look at it, but neither correspond to your statement. Your statement (" all the 32 SIMT units have something to do") is closer to the notion of utilization, as presented by Nsight Compute e.g. in the SOL report section. The two bars that Nsight compute presents in the bar chart in that section refer to SM utilization and memory utilization. The SM utilization is roughly similar, in my view, to your statement (" all the 32 SIMT units have something to do") although we could descend into another discussion, because a GPU SM does not consist of or contain “32 SIMT units”. Nevertheless, the notion you have expressed there, in my view, is related to utilization. And latency hiding is closely connected to utilization, and is reflected in the SM utilization reported by Nsight Compute.

Utilization is considered in aggregate, but taking into account cycle-by-cycle behavior. That is, if you can issue an instruction in the SM 50% of the time, then the utilization will be reported at approximately 50%. And yes, we could dissect that statement as well.

But the point I want to make is, if you know for certain that your utilization is at 100% (that is, in every clock cycle, in every SMSP of every SM, there is at least 1 eligible warp), then in my view simply increasing occupancy is unlikely to result in significant performance benefit. The most obvious path for increased occupancy to result in increased performance is if there is a corresponding increase in SM utilization (or perhaps memory utilization).

piotr33 · December 18, 2024, 6:10pm

I am very familiar with latency hiding techniques, albeit this knowledge doesn’t come from the GPU realm. So, it is all just about the executable warp pool capacity. The warp pool can be filled either by chopping long thread blocks or issuing them from the resident block set if there’s nothing to chop (<= 32). The hard limit of the latter is 24, so the calculator merely says “I could manage twice as many warps than I could extract from your blocks”. Indeed, it is not something to worry about in this application since the global memory is already saturated.

All is crystal clear now, I very much appreciate your explanation.

Robert_Crovella · December 18, 2024, 6:22pm

That’s a useful observation to make. The GPU has many potential bottlenecks, or pathways that could be limiters to performance. If you are “saturating” the memory bus (e.g. the memory utilization bar in Nsight Compute SOL report is at or above, say, 75%), then adding additional occupancy is unlikely to improve performance very much, in my view/experience. The usual suggestion in that case is to somehow “optimize” your utilization of memory - e.g. look for more opportunities to coalesce traffic, look for ways to reuse data e.g. in shared memory or caches, etc.

Conventional wisdom (and what will be reported in one of the nsight compute informational messages, in some cases) is that increased occupancy to improve latency hiding is most directly called for if both memory utilization and SM utilization are “low” (lets say each less than 50%) – a code condition sometimes referred to as “latency bound”, and occupancy is not already 100%.

piotr33 · December 18, 2024, 6:25pm

Yes, it appears I confused utilization with occupancy, and utilization is all I should care about at the end of the day. So, this issue is resolved.

But I do want to understand what you mean here:

Please could we descend? Is there a non-NDA paper explaining what really happens under the hood at the hardware level for, say, an Ada SM, with all the CUDA abstraction layers removed? I know FPGAs, so bare wires would be the most telling thing to me.

Robert_Crovella · December 18, 2024, 6:30pm

Foundational information can be found both in the programming guide as well as the profiler guide I already linked.

There are other resources as well, such as the GPU whitepapers (perhaps going back to some of the earliest published whitepapers, such as fermi), 3rd party microbenchmarking papers, the tutorial session I already linked, and various forum posts discuss these topics.

Sorry, I don’t know of anything like that, that discusses behavior or implementation at the HDL, RTL, or levels below that, or below CUDA.

Curefab · December 18, 2024, 6:36pm

I think Robert explained everything quite well and you probably understood everything perfectly, but I would emphasize a few important points:

How to have a block size of 64 with an algorithm for a block size of 32

You can always keep your exact algorithm and just make your block size larger. Just treat the extra 32 threads like a separate block: If threadIdx.x is the thread number from 0…31, then introduce a threadIdx.y from 0…1 and treat it like a blockIdx.w (i.e. something below blockIdx.x). Then your block size is 64, but your algorithm still uses 32 threads on the lowest level.

In this case, you have to adapt a few things: shared memory arrays need another index to distinguish between threadIdx.y == 0 or == 1 to keep two different shared memory storage areas. For performance reasons: __syncthreads() would synchronize 64 threads; with ptx inline numbered barriers you can separately synchronize fewer warps. In your example with 32 threads, just use __syncwarp().

Reasons for a maximum number of resident blocks

As stated, even if a block is separated into warps, and the scheduling is done on warp-level, the SM still has to manage, which block a warp belongs to, e.g. for block-wide synchronization. To make those features fast they have to be created in hardware and take away area on the GPU. As it is easy to use larger blocks with an algorithm for smaller block size, 24 resident blocks is a good compromise.

Robert_Crovella · December 18, 2024, 6:48pm

I think that is true, but may not always address the occupancy issue raised in this thread.

For example, if you had “unusually large” shared memory usage. To take an extreme example, if you required e.g. 96KB of shared memory (cc8.9) for each 32 threads, then you could not simply double the thread complement. For an example that corresponds somewhat to OP’s described case, if the shared usage were 99KB/24 = ~4KB per block (ie. per warp) then you could not simply double the block size and expect to double the occupancy.

piotr33 · December 18, 2024, 6:49pm

This suggestion is just brilliant, that’s exactly what I need and can do without tampering with the algorithmic part of the problem! Thanks a lot to both of you. I love this forum.

piotr33 · December 19, 2024, 11:50am

With the 2D block trick and vector loads I get 97.11% of the DRAM throughput according to Nsight/SOL, which is quite pleasing to the eye.

system · January 2, 2025, 11:50am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
are the threads of a warp run serially？ CUDA Programming and Performance	9	1101	February 27, 2020
Amount of Shared Memory CUDA Programming and Performance	10	4206	June 3, 2010
Optimisation of occupancy summary table CUDA Programming and Performance	10	805	September 15, 2023
Why some algorithm uses small block? Obviously the larger the better!? CUDA Programming and Performance	10	1012	September 28, 2023
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5901	July 25, 2007
Warp switching does anybody understands the mechanism CUDA Programming and Performance	16	8505	March 28, 2008
Efficient use of shared memory CUDA Programming and Performance	29	4467	December 2, 2019
CUDA Occupancy Calculator Helps pick optimal thread block size CUDA Programming and Performance	76	312115	September 13, 2011
Question regarding warp efficiency... CUDA Programming and Performance	9	15109	March 13, 2007
Maximising memory per thread CUDA Programming and Performance	4	3274	May 3, 2010

Block size and occupancy

How to have a block size of 64 with an algorithm for a block size of 32

Reasons for a maximum number of resident blocks

Related topics