CUDA Occupancy Calculator Helps pick optimal thread block size

I’m also interested in lmem explanation. Is there a limit on the size? Thanks.

lmem refers to local memory. Local variable usage are usually mapped to registers. But when you use “arrays” indexed by “variables” (instead of constants) then the compiler would allocate this array onto local-memory – which is dead dead slow. It will KILL your performance. You should consider a shared-memory implementation.

Also note that the “registers” that you find in the CUBIN file does NOT directly corresponds to the number of registers used by your program. None knows how it is calculated. It is NVIDIA internal. So, Dont worry about “lmem” affecting the CUDA occupancy.

CUDA occupancy is defined as ratio of the number of active warps per multi-processor to the maximum number of active warps.

Now, with 32 threads and 32 registers per thread, you get 1024 registers per block. Thus you could schedule a max of 8 blocks per multiprocessor (as MP has only 8192 registers). So, if your shared memory usage per block is <= 2K, then you can reach a max of 8 blocks per MP == 8 WARPS per MP.

8/24 (max 24 WARPS) = 0.3333… == 33% So, I dont find any discrepancy with the CUDA occupancy calculator here.

You have not published your shared mem usage. So, I would guess that your shared mem usage is more forcing your number of active blocks (warps) to a lesser number.

Let us know if you still see a problem with the CUDA occupancy calculator.

Hi, I’d like to ask a question about the interpretation of the results of the occupancy calculator. I have for example a kernel that uses a lot of registers and therefore I get only 33% occupancy using the calculator, but what does this actually mean? I doesn’t mean that my kernel is only using 33% of the computation potential a multiprocessor has? How should this result actually be thought of?
The documentation points out that if one has high occupancy then this helps with latency-hiding in memory-fetches as well as when syncing the threads and therefore reading this makes me think that high occupancy raises the probability of reaching maximal instruction throughput (the scheduler can run other threads while some are waiting for memory-fetches to complete), but does there then exist a theoretical lower limit for the occupancy, where it is possible to reach maximum throughput, provided that memory is never fetched or written? What would this limit be?
Why I’m interested in this, is that I would like to know at least some kind of estimate of what kind of performance increase is possible in my kernel if I really start optimizing the register usage. Right now the kernel has quite a lot of both memory fetches (from global memory - all coalesced reads of course) and floating point operations (a lot of multiplication and addition, which should result in a lot of nice multiply-adds, haven’t really dug in the ptxs yet though) and few writes (to global memory - coalesced, of course). The reading and the floating point ops are quite interleaved, and therefore it should be quite easy for the compiler to hide the latencies even on instruction level, but I would like to know how my lowish occupancy affects this.
What is a good occupancy level anyway? Is 100% the only option if one wishes high performance? I mean that I run out of registers quite quickly if I want to try to reach 100% (I think to reach this, one can use something like 10 registers as a maximum).
Also one option would be to allow more spilling to local memory and I’d like to know what are the consequences of this? Of course this introduces more memory accesses, but if the latencies of these can be totally hidden by higher occupancy, then what is the harm?

Ps. My kernel has 256 threads, uses 32 registers and smem-usage is something like 4kilos.

Lower occupancy does NOT necessarily mean “bad performance”. It is documented toward the bottom of the “HELP” sheet of the XLS.

I cut n paste it here for you:
"
Higher occupancy does not necessarily mean higher performance. If a kernel is not bandwidth bound, then increasing occupancy will not necessarily increase performance. If a kernel invocationis already running at least one thread block per multiprocessor in the GPU, and it is bottlenecked by computation and not by global memory accesses, then increasing occupancy may have no effect. In fact, making changes just to increase occupancy can have other effects, such as additional instructions, spills to local memory (which is off chip), divergent branches, etc. As with any optimization, you should experiment to see how changes affect the wall clock time of the kernel execution. For bandwidth bound applications, on the other hand, increasing occupancy can help better hide the latency of memory accesses, and therefore improve performance.
"

Dear Mark,

can you explain me why in CUDA_Occupancy_calculator, value B34 (MyRegsPerBlock) results from rounding (MyWarpsPerBlock2) to multiples of 4
and then multiplication to 16
MyRegCount?
Why just not MyThreadCount * MyRegCount?

Davide.

I am having trouble with the -cubin option. I cannot seem to get the file generated. I have the .bat file in my project directory. Here is my .bat file:

nvcc -ccbin "C:\Program Files\Microsoft Visual Studio 8\VC\bin" -keep -cubin -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/O2,/Zi,/MT -I"C:\CUDA\include" -I./ -I"C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc" %1

I run the .bat file using my project’s .cu file called template.cu:

run_nvcc.bat template.cu

I see the compilation results and a bunch of files are created in my project directory, but no .cubin file.

I appreciate any help. Thanks.

I’m using a G260, and was trying to work out how to calculate occupancy for kernals using purely double precision. At the moment I’ve got:

Multiprocessors per GPU:
24

Threads / Warp:
32

Warps / Multiprocessor:
32

Threads / Multiprocessor:
1024

Thread Blocks / Multiprocessor:
1

Total # of 32-bit registers / Multiprocessor:
16384

Shared Memory / Multiprocessor (bytes):
16384

Is this correct? The results I get don’t correspond to the results I get with the profiler, which leads me to think that I’m wrong here, but I can’t seem to work out anything that does tally, and also makes sense!

you can run more blocks at the same time per MP. At least 8 like G80. I have an adjusted excel sheet at work, so cannot check at this time

What difference does double precision make then? I’m confused…

It makes no difference at all (you will only use more registers when using double precision, but that is reported in the cubin or when running nvcc)

Hi Mark,

Do you have the spec for Quadro FX 5600 GeFroce 8600 GS?

Thank you.

Casy

You can find on the NVIDIA page which type of architecture those are, then just select the right architecture in the occupancy calculator.

but isn’t g200 a different architecture? or it is g84?

The Quadro FX 5600 & GeFroce 8600 GS are not GT200.
I saw a version of the occupancy calculator at NVISION on the screen of Brent Oster. So maybe mail boster@nvidia.com to ask him to post the latest version to the forums. Isn’t the latest version shipping with CUDA2.0 btw.?

Hello!

I have a question about the formula for the number of registers per block.

CEILING(MyWarpsPerBlock*2; 4)*16*MyRegCount

If I have a kernel which has one warp per block, would the kernel require the same amount of registers if I use two warps instead?

If this is the case, why?

Thanks!

Moritz

local memory is not same as registers, its same as the GPU memory , which is slow .

This is my batch file

nvcc -ccbin “C:\Program Files\Microsoft Visual Studio 8\VC\bin” -cubin -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/O2,/Zi,/MT -I"C:\Program Files\CUDA\include" -I./ -I"C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc" %1

This is my cmd line

runnvcccubin.bat cuda.cu

This is my error

nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified

Mind if anyone could help me?

So what is the best manner of reducing register usage? Using ‘const’ device function arguments? Is there a best practices guide?

I am just getting started and reviewing material at this point so please bear with me.

Is it possible to:

  1. convert the Occupancy Calculator to a CUDA api which can query the architecture and generate CSV files which then be viewed in excell?
  2. does the CUDA api provide queries to each of these parameters to do so?
  3. Would this provide (if done) the ability to query at program initialization/run time and optimize for future hardware (as some desire in this thread)?
  4. Is knowing the hardware layout/parameters something that NVIDIA means to expose to the developer or is it something that should inevitably be transparent to the developer?

Currently I believe there is still a core knowledge of the hardware that must be known by the developer of the hardware layout, so I am not sure how making this a runtime API where output metrics could be used for variables to maximize occupancy would fair against future hardware revisions. This hardware design, I believe, is also new and subject to change.

  1. Even if I were to code this into an API myself, would it stand the test of time(future hardware revisions)?
  2. Does anyone think this would be useful?