When trying to understand Nsight and the numerous statistics and figures produced I came across [this forum post]{Question about threads per block and warps per SM) as well as section 10 in the CUDA performance. They helped clarify some important aspects in my mind. I summarised my findings into this table.
Feedback and or corrections would be gratefully received.
Missing additional text for the shared memory column “The total shared memory using in a SM is the unit of shared memory (bytes) defined in a kernel * number of blocks put on the SM.”
I am not clear what you are trying to document in the latest spreadsheet. There is lack of documentation on each cell and the numbers are not correct.
For example,
Line 6 has 1536 (2048). CC=7.5 (Turing) has maximum of 1024 threads per SM and SM 8.6/8.9 (GA10x/Ada) has maximum number of threads per SM of 1536. CC 7.0/8.0/9.0 support 2048 threads per SM.
Thread blocks put on a SM column does not limit thread blocks to the maximum thread blocks per SM. This value differs per CC. The number is usually 1/2 the maximum warps per SM but that is not always the case. For example CC 8.6 is limited to 16 thread blocks.
Maximum registers per thread is determined per SM sub-partition not per SM. Each sub-partition has 1/4 of the total number of SM registers. Occupancy is calculated per SM sub-partition not per SM. The column states Max registers per thread cannot exceed 64Kb. For Turing+ there is 65536 registers per SM divided equally among the 4 SM sub-partitions. The limit per thread is 255 (round up to 256). The allocation granularity is 8 register/thread = 256 registers/warp. At the bottom it states “Register allocation granularity = 1 warp = 32 registers.” This is not correct. The register allocation granularity for most GPUs (CC >= 3.0) is 8 registers/thread == 256 registers/warp.
Column Kernel’s use of shared memory differs for each CC. The number of registers has no impact on shared memory. The SM SRAM on CC 7.0-9.0 divided between untagged SRAM (shared memory) and tagged SRAM for the unified L1/TEX cache.
At the bottom of the sheet it states “Any one column and row not matched results in less than 100% occupancy”. Please be careful as higher occupancy does not correlated to higher performance. In fact each kernel has a threshold. Increasing above a specific warp occupancy threshold can result in lower performance.
If you want to understand the occupancy calculation then I would recommend downloading a recent Nsight Compute and running the occupancy calculator. The calculator has the Physical Limit for the specified Compute Capability. It also has a drop down for the Shared Memory Size Config (Shared Memory and L1TEX split). The old XSL is deprecated. If you can find a version that supports > CC 7.0 then I would recommend reviewing the formulas in the XLS.
The latest occupancy calculator spreadsheet preserved by the Internet Archive supports GPUs up to and including compute capability 8.6. Captured in 2022, presumably from CUDA 11.x:
Thank you for this information. This will help a lot to clarify the occupancy resource squeeze across the different CUDA compute capabilities.
In short, the table tries to consolidate the five points of pressure that need to be balanced to maximise occupancy and show how shared memory is being used or overflowed (reduces occupancy). It provides me with a quick look up and ease of calculating the number of registers used, and percentage occupancy. How these five pillars and how they are interrelated seem sparse to me across different sections in different CUDA documents. This table ‘summarises’ how they are interconnected. How they can be brought together to aid the calculation of percentage occupancy.
As I learn this, I started out with what my GTX2060 is capable of. Finding the figures is not easy. The occupancy spread sheet helped here. However, this is not suitable for compute 8.6 and later.
I also got confused with the different figures used in various forum examples trying to understand which are correct or relevant to my situation. When CUDA documentation on occupancy say things like " … has an occupancy of 75%", or Nsight summarises in a profile “current occupancy of 50%”, I want to know how this was calculated. This table allows me to reverse engineer that percentage into its different parts. Maybe this table will not be needed in the future once I better understand Nsight.
I do appreciate your input. I will look at your information and try to incorporate it into the table while trying to keep the table simple but informative. It is my way of clearing the mist as someone else put it.
The maximum number of threads per thread block is 1024 on CC 3.0-9.0. You break this limit in the first two rows with 2048 threads and 1536 threads. For CC 7.0… (rows 7, 6)… the number of thread blocks for 2048 and 1536 should be 2 and 2. For CC 8.6/8.9 (row 6) the number of threads blocks for 1536 should be 2.
There is a maximum number of thread blocks per SM.
CC 7.0/… is 32
CC 7.5/86 is 16
CC 8.9 is 24
This limit is hit in rows 1,0. Given this limit is different on CC 8.6 and CC 8.9 you will have to use a different annotation.
For the Max registers I’m not clear on your calculation as CC 7.5 differs from the other CCs; however, all CC 5.0 - 9.0 have 64K registers per SM. The row continues past 255 registers/thread which is the limit for all architectures you are documenting. Row 6 for CC8.6/8.9 was stated to be wrong. There is a second reason. The register allocation granularity is 8 registers/thread. The count of 42 is not valid. The calculation needs to floor to 40 registers/thread. At 48 register/thread there would be insufficient registers to have 1536 threads.
Thanks Greg.
I have made a new table with your feedback, though I am afraid the table in still incorrect. It is also getting complicated with caveats as to how to read the table. May be it can be cleaned up a bit with further corrections?
Thank you for your further input.