Say, I have a kernel that consumes 27 registers, 48 bytes of shared mem and allocates a number of bytes of shared memory for each thread in block (say - N bytes).
How to calculate the number of threads that I can simultaneously run in a single block ?
So if shared mem is the limitation - number of threads computed based on it’s amount; if registers are the limitation - number of threads computed based on the number of available regs, if possible number of threads is greater than 512 - it is limited by 512 (to say nothing that the number of threads is a multiple of 32 in any case).
And this thing does not work. When my calculations give me 320 threads, kernel actually runs only with 256 threads in block (or reports 'too many resources requested for launch").
Is this 27 from your PTX file or you “think” there is 27 registers used. Compiler can use temp registers when optimizing your code.
Simple example:
float a;
a = a + sin(a);
Most people think this code uses only one register. But it is not true. GPU has no assembler instruction capable to add float value with it’s sinus in single instruction. So this code will be compiled to something like
TempReg = sin(a)
a = a +TempReg;
Large expressions could dramatically increase number of registers.
So only valid information about it would be from PTX code of your kernel.
What is alignment of that N bytes block? I see you didn’t assume it in your calculation. Check CUDA doc to see how alignment should be applied on different block size and apply it in your calculation.
EDIT:
I forgot to mention this is valid only if you have only one block per multiprocessor because it has 8192 registers shared over active blocks on that multiprocessor.
Here is what nvcc says:
ptxas info : Used 27 registers, 48+48 bytes smem, 44 bytes cmem[1]
27 registers is correct number. I’m not sure about 48+48 bytes … before your answer I’v been using -cubin for resource occupation examination, -Xptxas -v is used only after you comment. Does this 48+48 actually means that each thread consumes 96 bytes instead of 48 ?
Here is what cubin says about the kernel:
lmem = 0
smem = 48
reg = 27
So “48+48” is not clear …
N bytes block is composed of ints so it’s size is a multiple of 4 and I assume that it should be aligned perfectly in shared memory.
I believe 48+48 says something about how much you define yourself & how much the kernel uses in total because of the fact input-parameters are stored in shared memory. So you always have to take the bigger of the a+b values as input to the Occupancy calculator.