Maximal threads per block calculation Calc based in reg and shared mem usage..

Romant · June 26, 2008, 4:59pm

Hi All,

Say, I have a kernel that consumes 27 registers, 48 bytes of shared mem and allocates a number of bytes of shared memory for each thread in block (say - N bytes).

How to calculate the number of threads that I can simultaneously run in a single block ?

I do it in this manner:

#define THREADS_PER_BLOCK_MAX_MEM (int)(((16384 - 48) / N))

#define THREADS_PER_BLOCK_MAX_REG (int)(8192 / 27)

#define THREADS_PER_BLOCK min(512, min(\

	((THREADS_PER_BLOCK_MAX_MEM / 32) * 32),\

	((THREADS_PER_BLOCK_MAX_REG / 32) * 32)))

So if shared mem is the limitation - number of threads computed based on it’s amount; if registers are the limitation - number of threads computed based on the number of available regs, if possible number of threads is greater than 512 - it is limited by 512 (to say nothing that the number of threads is a multiple of 32 in any case).

And this thing does not work. When my calculations give me 320 threads, kernel actually runs only with 256 threads in block (or reports 'too many resources requested for launch").

How to calculate them right ?

Thanks in advance,

Romant.

mandrak · June 26, 2008, 11:42pm

Is this 27 from your PTX file or you “think” there is 27 registers used. Compiler can use temp registers when optimizing your code.

Simple example:

float a;

a = a + sin(a);

Most people think this code uses only one register. But it is not true. GPU has no assembler instruction capable to add float value with it’s sinus in single instruction. So this code will be compiled to something like

TempReg = sin(a)

a = a +TempReg;

Large expressions could dramatically increase number of registers.

So only valid information about it would be from PTX code of your kernel.

What is alignment of that N bytes block? I see you didn’t assume it in your calculation. Check CUDA doc to see how alignment should be applied on different block size and apply it in your calculation.

EDIT:

I forgot to mention this is valid only if you have only one block per multiprocessor because it has 8192 registers shared over active blocks on that multiprocessor.

so you also need to modify

define THREADS_PER_BLOCK_MAX_REG (int)(8192 / 27)

to

define THREADS_PER_BLOCK_MAX_REG (int)(8192 / (NumberOfRegisters * NumberOfActiveBlocks))

Romant · June 27, 2008, 7:56am

Here is what nvcc says:
ptxas info : Used 27 registers, 48+48 bytes smem, 44 bytes cmem[1]

27 registers is correct number. I’m not sure about 48+48 bytes … before your answer I’v been using -cubin for resource occupation examination, -Xptxas -v is used only after you comment. Does this 48+48 actually means that each thread consumes 96 bytes instead of 48 ?

Here is what cubin says about the kernel:
lmem = 0
smem = 48
reg = 27
So “48+48” is not clear …

N bytes block is composed of ints so it’s size is a multiple of 4 and I assume that it should be aligned perfectly in shared memory.

Active blocks - yes, I forgot about them.

E.D_Riedijk · June 27, 2008, 8:09am

I believe 48+48 says something about how much you define yourself & how much the kernel uses in total because of the fact input-parameters are stored in shared memory. So you always have to take the bigger of the a+b values as input to the Occupancy calculator.

Sarnath · June 27, 2008, 8:30am

Romant,

You should be careful here.

You should leave this job to the CUDA occupancy calculator. Things r not straightforward. There r some quirks.

For example:

The CUDA occupany XLS seems to round the amout of shared memory usage to 512.
If your block size is only 32 then the number of registers actually used per block is 64REGS instead of 32REGS

and so on.

These kind of quirks could differ between G80 and G92 and so on… It all depends. So, it is better to avoid doing such math in software.

Romant · June 27, 2008, 8:47am

I agree …

Simply forgot about the calculator. It informs me that highest occupancy is gained with 256 threads and 256 is a maximal value possible for my kernel.

jcornwall · June 30, 2008, 2:44pm

I extracted this from the occupancy calculator some time ago:

// T = min(512, floor(8192/(16*Rpt), 4) * 16)

reg = (reg > 0) ? reg : 1;

uint16_t nThreads = min(512, ((8192/(16*reg))&~3)*16);

I’ve only tested it for some kernels on G80.

Romant · June 30, 2008, 3:00pm

reg - the number of registers per thread acquired from cubin file ?

If so, 8192 - amount of available regs, 16 - number of SMs, I guess.

Correct ?

Topic		Replies	Views
shared memory and CUDA calculator CUDA Programming and Performance	6	4041	October 26, 2008
too large kernel solutions CUDA Programming and Performance	11	4281	September 2, 2008
number of threads and registers CUDA Programming and Performance	10	4867	March 14, 2008
Occupancy Calculation in check but still 'out of resource' error. CUDA Programming and Performance	4	3014	November 15, 2009
how to determine max number of blocks per kernel CUDA Programming and Performance	10	17220	September 11, 2011
Maximising memory per thread CUDA Programming and Performance	4	3274	May 3, 2010
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5901	July 25, 2007
maximum threads per block not always used CUDA Programming and Performance	2	754	June 14, 2018
Not enough shared mem CUDA Programming and Performance	5	5765	November 3, 2009
Question regarding warp efficiency... CUDA Programming and Performance	9	15109	March 13, 2007

Maximal threads per block calculation Calc based in reg and shared mem usage..

Related topics