Maximal threads per block calculation Calc based in reg and shared mem usage..

Hi All,

Say, I have a kernel that consumes 27 registers, 48 bytes of shared mem and allocates a number of bytes of shared memory for each thread in block (say - N bytes).

How to calculate the number of threads that I can simultaneously run in a single block ?

I do it in this manner:

#define THREADS_PER_BLOCK_MAX_MEM (int)(((16384 - 48) / N))

#define THREADS_PER_BLOCK_MAX_REG (int)(8192 / 27)

#define THREADS_PER_BLOCK min(512, min(\

	((THREADS_PER_BLOCK_MAX_MEM / 32) * 32),\

	((THREADS_PER_BLOCK_MAX_REG / 32) * 32)))

So if shared mem is the limitation - number of threads computed based on it’s amount; if registers are the limitation - number of threads computed based on the number of available regs, if possible number of threads is greater than 512 - it is limited by 512 (to say nothing that the number of threads is a multiple of 32 in any case).

And this thing does not work. When my calculations give me 320 threads, kernel actually runs only with 256 threads in block (or reports 'too many resources requested for launch").

How to calculate them right ?

Thanks in advance,

Romant.

  1. Is this 27 from your PTX file or you “think” there is 27 registers used. Compiler can use temp registers when optimizing your code.

Simple example:

float a;

a = a + sin(a);

Most people think this code uses only one register. But it is not true. GPU has no assembler instruction capable to add float value with it’s sinus in single instruction. So this code will be compiled to something like

TempReg = sin(a)

a = a +TempReg;

Large expressions could dramatically increase number of registers.

So only valid information about it would be from PTX code of your kernel.

  1. What is alignment of that N bytes block? I see you didn’t assume it in your calculation. Check CUDA doc to see how alignment should be applied on different block size and apply it in your calculation.

EDIT:

I forgot to mention this is valid only if you have only one block per multiprocessor because it has 8192 registers shared over active blocks on that multiprocessor.

so you also need to modify

define THREADS_PER_BLOCK_MAX_REG (int)(8192 / 27)

to

define THREADS_PER_BLOCK_MAX_REG (int)(8192 / (NumberOfRegisters * NumberOfActiveBlocks))

Here is what nvcc says:
ptxas info : Used 27 registers, 48+48 bytes smem, 44 bytes cmem[1]

27 registers is correct number. I’m not sure about 48+48 bytes … before your answer I’v been using -cubin for resource occupation examination, -Xptxas -v is used only after you comment. Does this 48+48 actually means that each thread consumes 96 bytes instead of 48 ?

Here is what cubin says about the kernel:
lmem = 0
smem = 48
reg = 27
So “48+48” is not clear …

N bytes block is composed of ints so it’s size is a multiple of 4 and I assume that it should be aligned perfectly in shared memory.

Active blocks - yes, I forgot about them.

I believe 48+48 says something about how much you define yourself & how much the kernel uses in total because of the fact input-parameters are stored in shared memory. So you always have to take the bigger of the a+b values as input to the Occupancy calculator.

Romant,

You should be careful here.

You should leave this job to the CUDA occupancy calculator. Things r not straightforward. There r some quirks.

For example:

  1. The CUDA occupany XLS seems to round the amout of shared memory usage to 512.
  2. If your block size is only 32 then the number of registers actually used per block is 64REGS instead of 32REGS

and so on.

These kind of quirks could differ between G80 and G92 and so on… It all depends. So, it is better to avoid doing such math in software.

I agree …

Simply forgot about the calculator. It informs me that highest occupancy is gained with 256 threads and 256 is a maximal value possible for my kernel.

I extracted this from the occupancy calculator some time ago:

// T = min(512, floor(8192/(16*Rpt), 4) * 16)

reg = (reg > 0) ? reg : 1;

uint16_t nThreads = min(512, ((8192/(16*reg))&~3)*16);

I’ve only tested it for some kernels on G80.

reg - the number of registers per thread acquired from cubin file ?

If so, 8192 - amount of available regs, 16 - number of SMs, I guess.

Correct ?