Say, I have a kernel that consumes 27 registers, 48 bytes of shared mem and allocates a number of bytes of shared memory for each thread in block (say - N bytes).
How to calculate the number of threads that I can simultaneously run in a single block ?
I do it in this manner:
#define THREADS_PER_BLOCK_MAX_MEM (int)(((16384 - 48) / N)) #define THREADS_PER_BLOCK_MAX_REG (int)(8192 / 27) #define THREADS_PER_BLOCK min(512, min(\ ((THREADS_PER_BLOCK_MAX_MEM / 32) * 32),\ ((THREADS_PER_BLOCK_MAX_REG / 32) * 32)))
So if shared mem is the limitation - number of threads computed based on it’s amount; if registers are the limitation - number of threads computed based on the number of available regs, if possible number of threads is greater than 512 - it is limited by 512 (to say nothing that the number of threads is a multiple of 32 in any case).
And this thing does not work. When my calculations give me 320 threads, kernel actually runs only with 256 threads in block (or reports 'too many resources requested for launch").
How to calculate them right ?
Thanks in advance,