Registers per thread limit and occupancy

According to the docs the number of registers available to a thread is a multiple of 64. It also says that the available registers per multiprocessor is 8192 and also that the maximal number of threads per multiprocessor is 768.

Let’s aim for 100% occupancy so let’s take 96 threads per block and 8 registers per thread or 128 threads per block and 10 registers per thread (and very small amount of shared memory so that it will not be the limiting factor). This really gives the maximal 24 active warps, 8 active blocks and 768 active threads or in the second case 24 active warps, 6 active blocks and 768 active threads per multiprocessor.

What I don’t understand is why can’t I have more registers per thread while maintaining 100% occupancy when registers come in multiples of 64 anyway? It seems 8192 / 768 = 10 registers per thread is a hard cut off for 100% occupancy but then what’s the meaning of having registers per thread in multiples of 64?

This is a little confusing in the programming guide (fixed in next version), thanks for pointing it out. It’s not that registered are allocated in multiples of 64…

Here’s the new info that will be in the programming guide:

Several blocks can be processed by the same multiprocessor concurrently by allocating the multiprocessor’s registers and shared memory among the blocks. More precisely, the number of registers available per thread is equal to:

N_registersPerMultiprocessor / CEIL(N_concurrentBlocks*N_threadsPerBlock, 64)

where N_registersPerMultiprocessor is the total number of registers per multiprocessor, N_concurrentBlocks is the number of concurrent blocks, N_threadsPerBlock is the number of threads per block, and CEIL(X, 64) means rounded up to the nearest multiple of 64.

(So the 64 is not referring to registers, but to threads)

Mark

Thanks Mark, makes more sense now.

Then the conclusion is really that 100% occupancy really implies a 10 registers/thread limit. Good to know :)

Yes. Another way to look at it is that if you are limited to less than 100% occupancy by shared memory usage or thread count (greater than 384 threads per block), then you can use extra registers “for free”. And vice versa: if you are limited by thread count or register usage, you can use more shared memory per block. If you are limited by either shared memory or register usage, then you can possibly use more threads per block to increase occupancy.

Mark