CUDA Occupancy Calculator Helps pick optimal thread block size

I’m also interested in this since the number of registers per thread is currently my only bottleneck in terms of maximizing the occupancy, so any best practice guidelines in order to reduce the number of registers used would be greatly appreciated.

Note: The spammer on this thread has been banned to infinity and beyond.

Hi Mark, I have a query, when I compile my .cu file I find that the cubin info for that kernel appears in the .obj file generated and is the same that using the -cubin command line syntax which says that I use 448 shared memory with 12 registers, now when I compile using the “$(CUDA_BIN_PATH)\nvcc.exe” --ptxas-options=-v -ccbin …" in the output of visual studio shows me this: 1>ptxas info : Used 12 registers, 448+444 bytes smem, 8 bytes cmem[1]

so, which value for shared memory I should take in count to use it in the CUDA profiler and make a good choice about an optimal thread block size: 448+444, or 448?

Best regards
Lermy

Input data:

Compute Compatibility: 1.3 (GTX 260)
Threads per block: 416
Registers ber thread: 19
Shared memory per block: 5268
no dynamic shared memory

Calculator shows 81% (2 blocks per multiprocessor)
however profiler gives 40.6% (1 block only)

why the difference?

With 416 threads, you should use less than 19 registers for 81% occupancy. This configuration is very very very familiar to me (so I don’t need the Occupancy Calculator to answer :-) ), what are you doing there?

Ah… I used an old version of the Calculator. I was using 1.4 while there is a 1.5 version. Now I see I must use only 384 threads for that many registers.

With some unrelated changes, my register usage reduced below 16 though…

Computing SAH costs for a kd-tree.

I would also like to know how to properly handle the plus sign in the verbose ptxas output.

Hi!

Why do you calculate the shared memory demand with

(round_up(SharedMemoryPerBlock/512)*512) ?

Thanks in advance!

I know this is an old question, but it might be useful to give an answer anyway (for others who are interested).

Some guidelines and case-studies on register optimization have been discussed in a recently published paper at the ODES workshop. You can find the paper here:

http://www.imec.be/odes/odes-8_proceedings.pdf (starting at the 7th page of the document)

Looks very useful. Will the CUDAVis tool be made available?

I am voting too for cudavis. Btw, even some versions of C on general processors had register keyword to place variable in registers. But cuda c has not it… with Fermi it should not be very big issue, but on gt200 people some times are really struggle

Most all modern C/C++ compilers ignore the register keyword.

Yep, I remember some dos compilers. But in cuda c it could be really usefull.

On his website it is mentioned that he has to upload it. There is already a page where the download should appear.

Ah, thanks. I found it here: http://www.es.ele.tue.nl/~gpuattue/?nav=tools

Who knows how is to obtain rubby tool?

The tool is indeed available as stated.

However, usability might be not ideal, since you have to install some extensions to the Ruby programming language (readme: http://www.es.ele.tue.nl/~gpuattue/downloa…_0.1-readme.txt ).

Users of OS X will have Ruby pre-installed, Linux users should not have to much trouble either. However, I do not have any experience with Windows.

Hello Cedric!
I hope you’re open to critique.
During my tests the compiler with maxrregcount option specified wins over your optimizations in all cases, including the ones published in your work. Both microbenchmarks and full scale experiments prove your optimizations achieve worse results than the compiler.
I too see the compiler register allocation is buggy in some cases, but the way you try to fix it needs a bit more rules to follow than simply “less registers is better”.
I’m preparing a publication with my benchmarks. It will be available soon.

Of course I am! However, the point of my work is to show that the compiler lacks a good automatic register allocation technique. The optimizations in my work are used for illustration, not so much as to compete with the compiler (using maxregcount).

Also, let me know when your work is finished. Are you planning on working on an auto-tuner of some sort?

My article, as promised.
Actually yes, I do plan to work on some automatic way to solve the reg allocation problem, as I see no improvement in new toolkits, and some of my kernels are totally broken by this.
less_registers_is_NOT_better.pdf (175 KB)