Block size and occupancy

I have a vast number of blocks to saturate the hardware, but for algorithmic reasons, the preferred number of threads per block is 32. This also nicely matches the hardware warp size of 32 threads. Each thread works with 4 bytes of data, making coalesced memory access patterns of 32*4=128 bytes. I thought this design was optimal, but the occupancy calculator says there is 50% thread occupancy for an 8.9 device.

So, where does the remaining 50% go at the HW level? That would suggest the warp size is 64 and the global memory bus width is 256 bytes, while it is half of that, AFAIK. What is the tool really trying to tell me?

EDIT: in other words, my assumption is that the number of blocks per SM is way beyond its HW saturation point, and there is enough shared memory capacity to enable multiple block instances to be simultaneously active. That would imply that SM would be equally happy working with 32-entry warps originating from various blocks instead of 32-entry warps from a single wider thread block. The memory bus width is saturated, all the 32 SIMT units have something to do, i.e. 100% occupancy. But the calculator says it’s 50%, which escapes me.

The maximum number of resident blocks on a SM for cc 8.9 is 24. You would need 48 blocks, if each block is 32 threads, to hit the maximum thread complement of 1536.

You’re running into a hardware limit that prevents full occupancy with the choice of 32 threads per block. For the things you have under your control, one of the options to fix the report is to move to 64 threads per block, or higher. This does not indicate that the warp size is actually 64; its just another hardware limit of the device.

This (50% occupancy) may or may not matter for performance. Higher occupancy is somewhat correlated with higher performance in a GPU, but that is a very general statement, and not true in every case.

Hello Robert,

Thank you for your answer. I think I am missing the details of mapping between thread blocks and warps at the lowest level. I presumed that thread blocks are just chopped into warps, and the warps are scheduled for SIMT execution when all their data dependencies are resolved. It is clear that other capacity restrictions apply (registers, shared memory, the 24 resident block limit you mentioned, and the 128-byte aggregated global bus width). Still, the SM should otherwise be flooded with warps. In this sense, the HW is already 100% utilised in terms of RAM bandwidth and ALU assignments; there should be no performance gain by adding more threads to a block, which is in line with what you wrote.

So, what hardware capacity is underutilized? I can (reluctantly) bump the block size to 64, but why would I want to do it in the first place?

The GPU generally likes to have full occupancy. One of the most important reasons for this is due to the idea of latency hiding. You can find many descriptions of this in various forum write-ups, and it is covered in an organized fashion in unit 3 of this online tutorial series.

Briefly, just because the SM has, say, 24 warps, does not mean in all cases (i.e. in every clock cycle) that it can find a warp that is ready to issue a new instruction. But the SM (or SMSP) is operating at highest throughput when it can issue an instruction in every clock cycle. Warps can be stalled for various reasons, the most common one being waiting on a dependency to be satisfied, before execution (of that warp/threads) can continue.

If you had 24 warps (assigned to a SM), then you will have up to 24 warps to choose from. If you had 48 warps, then you will have 48 warps to choose from (the details are more complicated than this, because a cc8.9 SM is really broken into 4 SMSPs, but for general understanding of the idea of latency hiding, we can consider things at an aggregate level.)

If you have 48 warps to choose from, then in some cases, it will be “less likely” (as compared to the 24 warp case) that in a given clock cycle, there are no “eligible warps”. In that case, average code throughput increases or could be higher.

Again, this is not a blanket guarantee. Higher occupancy is somewhat correlated to higher performance, but it is not 100% correlated. I can’t say whether a block size of 64 would actually help your code, from a performance perspective.

That isn’t the definition of occupancy, at least not the way NVIDIA tools use the word occupancy. Occupancy is the number of threads actually resident on a SM compared to the number of threads that could be resident on a SM. Even this has a couple different ways to look at it, but neither correspond to your statement. Your statement (" all the 32 SIMT units have something to do") is closer to the notion of utilization, as presented by Nsight Compute e.g. in the SOL report section. The two bars that Nsight compute presents in the bar chart in that section refer to SM utilization and memory utilization. The SM utilization is roughly similar, in my view, to your statement (" all the 32 SIMT units have something to do") although we could descend into another discussion, because a GPU SM does not consist of or contain “32 SIMT units”. Nevertheless, the notion you have expressed there, in my view, is related to utilization. And latency hiding is closely connected to utilization, and is reflected in the SM utilization reported by Nsight Compute.

Utilization is considered in aggregate, but taking into account cycle-by-cycle behavior. That is, if you can issue an instruction in the SM 50% of the time, then the utilization will be reported at approximately 50%. And yes, we could dissect that statement as well.

But the point I want to make is, if you know for certain that your utilization is at 100% (that is, in every clock cycle, in every SMSP of every SM, there is at least 1 eligible warp), then in my view simply increasing occupancy is unlikely to result in significant performance benefit. The most obvious path for increased occupancy to result in increased performance is if there is a corresponding increase in SM utilization (or perhaps memory utilization).

I am very familiar with latency hiding techniques, albeit this knowledge doesn’t come from the GPU realm. So, it is all just about the executable warp pool capacity. The warp pool can be filled either by chopping long thread blocks or issuing them from the resident block set if there’s nothing to chop (<= 32). The hard limit of the latter is 24, so the calculator merely says “I could manage twice as many warps than I could extract from your blocks”. Indeed, it is not something to worry about in this application since the global memory is already saturated.

All is crystal clear now, I very much appreciate your explanation.

That’s a useful observation to make. The GPU has many potential bottlenecks, or pathways that could be limiters to performance. If you are “saturating” the memory bus (e.g. the memory utilization bar in Nsight Compute SOL report is at or above, say, 75%), then adding additional occupancy is unlikely to improve performance very much, in my view/experience. The usual suggestion in that case is to somehow “optimize” your utilization of memory - e.g. look for more opportunities to coalesce traffic, look for ways to reuse data e.g. in shared memory or caches, etc.

Conventional wisdom (and what will be reported in one of the nsight compute informational messages, in some cases) is that increased occupancy to improve latency hiding is most directly called for if both memory utilization and SM utilization are “low” (lets say each less than 50%) – a code condition sometimes referred to as “latency bound”, and occupancy is not already 100%.

Yes, it appears I confused utilization with occupancy, and utilization is all I should care about at the end of the day. So, this issue is resolved.

But I do want to understand what you mean here:

Please could we descend? Is there a non-NDA paper explaining what really happens under the hood at the hardware level for, say, an Ada SM, with all the CUDA abstraction layers removed? I know FPGAs, so bare wires would be the most telling thing to me.

Foundational information can be found both in the programming guide as well as the profiler guide I already linked.

There are other resources as well, such as the GPU whitepapers (perhaps going back to some of the earliest published whitepapers, such as fermi), 3rd party microbenchmarking papers, the tutorial session I already linked, and various forum posts discuss these topics.

Sorry, I don’t know of anything like that, that discusses behavior or implementation at the HDL, RTL, or levels below that, or below CUDA.

I think Robert explained everything quite well and you probably understood everything perfectly, but I would emphasize a few important points:

How to have a block size of 64 with an algorithm for a block size of 32

You can always keep your exact algorithm and just make your block size larger. Just treat the extra 32 threads like a separate block: If threadIdx.x is the thread number from 0…31, then introduce a threadIdx.y from 0…1 and treat it like a blockIdx.w (i.e. something below blockIdx.x). Then your block size is 64, but your algorithm still uses 32 threads on the lowest level.

In this case, you have to adapt a few things: shared memory arrays need another index to distinguish between threadIdx.y == 0 or == 1 to keep two different shared memory storage areas. For performance reasons: __syncthreads() would synchronize 64 threads; with ptx inline numbered barriers you can separately synchronize fewer warps. In your example with 32 threads, just use __syncwarp().

Reasons for a maximum number of resident blocks

As stated, even if a block is separated into warps, and the scheduling is done on warp-level, the SM still has to manage, which block a warp belongs to, e.g. for block-wide synchronization. To make those features fast they have to be created in hardware and take away area on the GPU. As it is easy to use larger blocks with an algorithm for smaller block size, 24 resident blocks is a good compromise.

I think that is true, but may not always address the occupancy issue raised in this thread.

For example, if you had “unusually large” shared memory usage. To take an extreme example, if you required e.g. 96KB of shared memory (cc8.9) for each 32 threads, then you could not simply double the thread complement. For an example that corresponds somewhat to OP’s described case, if the shared usage were 99KB/24 = ~4KB per block (ie. per warp) then you could not simply double the block size and expect to double the occupancy.

This suggestion is just brilliant, that’s exactly what I need and can do without tampering with the algorithmic part of the problem! Thanks a lot to both of you. I love this forum.

With the 2D block trick and vector loads I get 97.11% of the DRAM throughput according to Nsight/SOL, which is quite pleasing to the eye.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.