CUDA Occupancy Calculator Helps pick optimal thread block size

Hello CUDA users,

We’ve just posted a new tool on the cuda site, the CUDA Occupancy Calculator. This tool is an MS excel spreadsheet that helps you choose thread block size for your kernel in order to achieve highest occupancy of the GPU. The CUDA Occupancy Calculator can be found on the CUDA homepage.

Here is a direct link: CUDA Occupancy Calculator

Hopefully this will answer your questions about occupancy, register file size, how register and shared memory usage affect efficiency, and how to tune your thread block size. The documentation for this tool follows, but it is also included in the excel spreadsheet on the “help” tab.

As we release new GPUs with different parameters, we’ll add them to this tool so it can be used for any GPU.

If you have questions about or problems with the CUDA Occupancy Calculator please post them in the forums.

Cheers,
Mark

------------------------------------ Documentation ------------------------------------------

Overview
The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers available for use by CUDA thread programs. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage to maximize the number of thread blocks that can be active in the machine simultaneously. If a program tries to launch a kernel for which the registers used per thread times the thread block size is greater than N, the launch will fail.

The size of N on G80 is 8192 32-bit registers per multiprocessor.

Maximizing the occupancy can help to cover latency during global memory loads that are followed by a __syncthreads(). The occupancy is determined by the amount of shared memory and registers used by each thread block. Because of this, programmers need to choose the size of thread blocks with care in order to maximize occupancy. This GPU Occupancy Calculator can assist in choosing thread block size based on shared memory and register requirements.

Instructions
Using the CUDA Occupancy Calculator is as easy as 1-2-3. Change to the calculator sheet and follow these three steps.
1.) First select your GPU in the green box.

2.) For the kernel you are profiling, enter the number of threads per thread block, the registers used per thread, and the total shared memory used per thread block in bytes in the orange block. See below for how to find the registers used per thread.

3.) Examine the blue box, and the graph to the right. This will tell you the occupancy, as well as the number of active threads, warps, and thread blocks per multiprocessor, and the maximum number of active blocks on the GPU. The graph will show you the occupancy for your chosen block size as a red triangle, and for all other possible block sizes as a line graph.

You can now experiment with how different thread block sizes, register counts, and shared memory usages can affect your GPU occupancy.

Determining Registers Per Thread and Shared Memory Per Thread Block
To determine the number of registers used per thread in your kernel, simply compile the kernel code using the -cubin option to nvcc. This will generate a .cubin file, which you can open in a text editor. Look for the “code” section with your kernel’s name. Within the curly braces ("{ … }") for that code block, you will see a line with “reg = X”, where x is the number of registers used by your kernel. You can also see the amount of shared memory used as “smem = Y”. However, if your kernel declares any external shared memory that is allocated dynamically, you will need to add the number in the .cubin file to the amount you dynamically allocate at run time to get the correct shareded memory usage. An example is below:

code {
name = my_kernel
lmem = 0
smem = 24
reg = 5
bar = 0
bincode { … }
const { … }
}

Let’s say “my_kernel” contains an external shared memory array which is allocated to be 2048 bytes at run time. Then our total shared memory usage is 2072 bytes. We enter this into the box labeled “shared memory per block (bytes)”, and we enter the number of registers used by my_kernel, 5, in the box labeled registers per thread. We then enter our thread block size and the calculator will display the occupancy.

For more information on NVIDIA CUDA, visit http://developer.nvidia.com/cuda

Hi,

In section 5.1 of the documentation is noted that maximum number of theads per block is 512. In CUDA Occupancy Calculator this value is equal to 768, but when I’ve tried to input this value in Occupancy Calculator, I’ve got an error. So, my question is - which of those values is correct?

The maximum number of threads per block is not equal to the maximum number of threads per processor. With multiple blocks you can get 768 threads on one processor.

Thanks for answer :)

Hi,

do you have the specs for the GeForce 8800 GTS version ?
I would like to add it to the calculator!

Thank you very much in advance.

GTS is the same as GTX except that it has fewer multiprocessors. Since the occupancy is calculated per multiprocessor, this change doesn’t affect the calculator, so I didn’t add GTS as a different GPU. A G80 is a G80 is a G80 in this regard. :) When future CUDA-supporting GPUs are released, I will add their information and post a new calculator.

Mark

When I enter my kernel’s data in the calculator (192 threads/block, 35 reg/thread, 3872 smem/block) I get 0% occupancy. What does it mean to have 0% occupancy? My kernel seems to work fine and produce correct results though. It is also faster than the CPU implementation and my former GPU implementation (OpenGL/Cg).

-MH

Hi,

Could you please let me know how I could do this in a windows environment. I am using Visual Studio 2003. I changed the compile option to -cubin instead of -ccbin in the command line in the properties for the .cu file. But, it did not work.

Thanks,

Shyam

I made a simple batch file to do this from command line. The file contains only one line as follows:

nvcc -ccbin "C:\Program Files\Microsoft Visual Studio .NET 2003\Vc7\bin" -cubin -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/O2,/Zi,/MT -I"C:\CUDA\include" -I./ -I"C:\Program Files\NVIDIA Corporation\NVIDIA SDK 10\NVIDIA CUDA SDK\common\inc" %1

I use it from command line like this

runnvcccubin.bat file_name.cu

You may need to change the Visual Studio and NVIDIA SDK directories to make it work in your environment.

-MH

yes I see, thought that for instance the number of total registers might change .

thx for the answer.

I have three additional questions:

I have evaluated the performance of the grafic card by comparing the computation time of the CPU and of the NVIDIA using matrixmul and simplecublas. Additionally the size of the matrix is variable, so that one can clearly see the speed up when using the grafic card in case of bigger matrices.

However the computation time of matrixmul is always greater than simplecublas. Why is that? (One possible answer might be that the cublas implementations are optimized, btu i am not sure)

  1. If one has a closer on the occupancy calculator in case of matrixmul with threads/Block = 256, registers/thread = 14, shared mem/block = 2048, the percentage of occupancy of each mutli equals 67%.

As i understand, there are mainly three options in order to increase the performance: changing the thread block dimensions, the shared mem size or the number or registers. In the case of matrixmul, the number of registers used per thread causes in this case a bottle neck.

What chance do i have in order to increase the occupancy although i dont have any influence on the number of register (as the compiler tries to increase the number of threads while using less registers)?

  1. Although i have not changed any line of matrixmul, the program crashes when the matrixdimension increases to 5120x5120. Memory allocation works fine. I dont have a clue, why the program is crashing :blink: .

Thx in advance for your help.

This post is getting off topic for this thread. I want to keep this thread for discussion of the occupancy calculator tool, so please repost your questions in a new topic.

Thanks,
Mark

Thanks a lot. The batch file worked perfectly.

Shyam

Hi all,

There’s an update to the CUDA occupancy calculator on the
CUDA Webpage. This is version 1.1, and it includes a bug fix to fix issues with the occupancy being incorrectly calculated.

It also now has 3 graphs – the block size graph from v1.0, plus graphs of how occupancy will change for the selected block size with varying register and shared memory usage.

Mark

Great! Thanks very much for the new release.

Daniel

Mark,

After using the occupancy calculator I found that I am getting 13% per multiprocessor which sounds bad. The kernel is definitely “bandwidth bound” as you say - much more time spent in data fetch from global memory than in calculations. The nature of my kernel is that none of the threads can really run until all the data is loaded in to shared memory. They are set up to load global data coalesced into shared memory as fast as possible, with the smallest amount of instructions, do some calculations, and then write the data back out again in a similar fashion.

Will increasing the occupancy by increasing the number of threads per block give me better performance in this case? What type of memory access patterns benefit from more threads per block?

Mark, I wonder if you could simply list the necessary and sufficient conditions for 100% occupancy for a given kernel, my understanding is:

  1. registers <= 10
  2. threads/block mod 32 == 0
  3. warps/block is a divisor of 24
  4. shared mem/block <= 16Kb * (warps/block) / 24 - any alignment constraint?
    constant memory does not come into it as it is the same for all blocks

An official confirmation would be helpful and perhaps it should be in the manual in the G80 specific area. If it is there I have missed it.
Thanks, Eric

ed: and then run N * 16 * 24 / (warps/block) blocks, assuming they all execute for the same time.

Since there are only a few thread counts that satisfy those requirements, maybe we can summarize like this:

Max registers: 10

Threads per Block…Max shared mem (bytes)

96…2048

128…2730

192…4096

256…5461

384…8192

Dear Mark,

there is something which keeps me puzzled. My kernel cubin file tells me that

lmem = 640, smem = 1092, reg = 33

Now I know that the number of registers is the limiting factor for my kernel according to
your Occupancy Calculator. But could you please tell me what impact the lmem = 640 has.
I get this lmem in my code when I allocate arrays like

float4 fvList[40];

I though lmem (= local memory?) is the same as Registers. But it seems otherwise,
because changing that value does not change the “Occupancy”, since it’s never used in
your calculation.

Any explanation would be welcome,
Jake

Mark, and what about G86 ? :-)

I have 8500 GT and it is built on the G86 hardware … could you please specify the info for this chip ?

Many thanks in advance.

Hi Mark,

There was no response to my report of this documentation bug. It also affects the occupancy calculator which shows 33% occupancy for 32 threads and 32 registers when infact I measure 17% occupancy and the profiler shows .167 occupancy. Reducing registers to 16 gets you up to 33%.

So is the bug in the 64 bit driver or is it a hardware restriction that is not correctly documented in the guide and a bug in the occupancy calculator?

The particular configuration of 8 blocks,32 threads and 32 registers is a good one on the G80 apart from the fact that it does not work!

Eric