According to the docs the number of registers available to a thread is a multiple of 64. It also says that the available registers per multiprocessor is 8192 and also that the maximal number of threads per multiprocessor is 768.
Let’s aim for 100% occupancy so let’s take 96 threads per block and 8 registers per thread or 128 threads per block and 10 registers per thread (and very small amount of shared memory so that it will not be the limiting factor). This really gives the maximal 24 active warps, 8 active blocks and 768 active threads or in the second case 24 active warps, 6 active blocks and 768 active threads per multiprocessor.
What I don’t understand is why can’t I have more registers per thread while maintaining 100% occupancy when registers come in multiples of 64 anyway? It seems 8192 / 768 = 10 registers per thread is a hard cut off for 100% occupancy but then what’s the meaning of having registers per thread in multiples of 64?