CUDA Occupancy Calculator Helps pick optimal thread block size

jjtapiav · March 6, 2009, 2:36pm

I’m also interested in this since the number of registers per thread is currently my only bottleneck in terms of maximizing the occupancy, so any best practice guidelines in order to reduce the number of registers used would be greatly appreciated.

David_Weller · April 16, 2009, 10:36pm

Note: The spammer on this thread has been banned to infinity and beyond.

Lermy · May 15, 2009, 5:38pm

Hi Mark, I have a query, when I compile my .cu file I find that the cubin info for that kernel appears in the .obj file generated and is the same that using the -cubin command line syntax which says that I use 448 shared memory with 12 registers, now when I compile using the “$(CUDA_BIN_PATH)\nvcc.exe” --ptxas-options=-v -ccbin …" in the output of visual studio shows me this: 1>ptxas info : Used 12 registers, 448+444 bytes smem, 8 bytes cmem[1]

so, which value for shared memory I should take in count to use it in the CUDA profiler and make a good choice about an optimal thread block size: 448+444, or 448?

Best regards
Lermy

Cygnus_X1 · June 18, 2009, 5:10pm

Input data:

Compute Compatibility: 1.3 (GTX 260)
Threads per block: 416
Registers ber thread: 19
Shared memory per block: 5268
no dynamic shared memory

Calculator shows 81% (2 blocks per multiprocessor)
however profiler gives 40.6% (1 block only)

why the difference?

cvnguyen · June 19, 2009, 3:28am

With 416 threads, you should use less than 19 registers for 81% occupancy. This configuration is very very very familiar to me (so I don’t need the Occupancy Calculator to answer :-) ), what are you doing there?

Cygnus_X1 · June 19, 2009, 2:51pm

Ah… I used an old version of the Calculator. I was using 1.4 while there is a 1.5 version. Now I see I must use only 384 threads for that many registers.

With some unrelated changes, my register usage reduced below 16 though…

Computing SAH costs for a kd-tree.

veltrop · June 29, 2009, 1:33am

I would also like to know how to properly handle the plus sign in the verbose ptxas output.

schmeing · September 1, 2009, 7:57am

Hi!

Why do you calculate the shared memory demand with

(round_up(SharedMemoryPerBlock/512)*512) ?

Thanks in advance!

CNugteren · April 14, 2010, 8:42am

I know this is an old question, but it might be useful to give an answer anyway (for others who are interested).

Some guidelines and case-studies on register optimization have been discussed in a recently published paper at the ODES workshop. You can find the paper here:

http://www.imec.be/odes/odes-8_proceedings.pdf (starting at the 7th page of the document)

allanmac · April 23, 2010, 5:12pm

Looks very useful. Will the CUDAVis tool be made available?

Lev · April 23, 2010, 11:41pm

I am voting too for cudavis. Btw, even some versions of C on general processors had register keyword to place variable in registers. But cuda c has not it… with Fermi it should not be very big issue, but on gt200 people some times are really struggle

seibert · April 23, 2010, 11:47pm

Most all modern C/C++ compilers ignore the register keyword.

Lev · April 23, 2010, 11:49pm

Yep, I remember some dos compilers. But in cuda c it could be really usefull.

E.D_Riedijk · April 24, 2010, 6:44pm

On his website it is mentioned that he has to upload it. There is already a page where the download should appear.

allanmac · April 24, 2010, 6:53pm

Ah, thanks. I found it here: http://www.es.ele.tue.nl/~gpuattue/?nav=tools

Lev · April 24, 2010, 11:47pm

Who knows how is to obtain rubby tool?

CNugteren · May 6, 2010, 7:43am

The tool is indeed available as stated.

However, usability might be not ideal, since you have to install some extensions to the Ruby programming language (readme: http://www.es.ele.tue.nl/~gpuattue/downloa…_0.1-readme.txt ).

Users of OS X will have Ruby pre-installed, Linux users should not have to much trouble either. However, I do not have any experience with Windows.

wwa · May 11, 2010, 6:25pm

Hello Cedric!
I hope you’re open to critique.
During my tests the compiler with maxrregcount option specified wins over your optimizations in all cases, including the ones published in your work. Both microbenchmarks and full scale experiments prove your optimizations achieve worse results than the compiler.
I too see the compiler register allocation is buggy in some cases, but the way you try to fix it needs a bit more rules to follow than simply “less registers is better”.
I’m preparing a publication with my benchmarks. It will be available soon.

CNugteren · May 17, 2010, 7:31am

Of course I am! However, the point of my work is to show that the compiler lacks a good automatic register allocation technique. The optimizations in my work are used for illustration, not so much as to compete with the compiler (using maxregcount).

Also, let me know when your work is finished. Are you planning on working on an auto-tuner of some sort?

wwa · May 21, 2010, 10:46pm

My article, as promised.
Actually yes, I do plan to work on some automatic way to solve the reg allocation problem, as I see no improvement in new toolkits, and some of my kernels are totally broken by this.
less_registers_is_NOT_better.pdf (175 KB)

Topic		Replies	Views
Efficient use of shared memory CUDA Programming and Performance	29	4482	December 2, 2019
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5901	July 25, 2007
Using Shared Memory in CUDA C/C++ Technical Blog	36	1993	October 8, 2020
shared memory vs registers CUDA Programming and Performance	19	8056	July 28, 2017
[ON HOLD] Issue with cuda_occupancy and cudaDeviceSetCacheConfig(...) CUDA Programming and Performance	7	2314	June 26, 2018
Test Multi Threading Spinning CUDA Programming and Performance	32	4811	July 20, 2011
I've a question about CUDA Occuapncy Calculator by NVIDIA CUDA Programming and Performance	13	2563	March 5, 2013
Newbie - Need to use shared mem? CUDA Programming and Performance	27	14988	December 17, 2008
A block size less than 32? CUDA Programming and Performance	37	7976	December 17, 2018
Slow local memory, feigned constant memory. coalesced? global? CUDA Programming and Performance	29	7261	January 25, 2010

CUDA Occupancy Calculator Helps pick optimal thread block size

Related topics