What are the implications on having a potentially huge grid?
I ask this because, if there is a limited number of registers that is equally divided between all threads, it could overcome this limit, right?
Let’s say for instance that I only have 1 MP, which means a total of 8192 available registers. So if I launch my kernel with <<<4,256>>> each thread has a total of 8 available registers. What if a launch it with <<<40,256>>>, meaning a total of 10240. Since this number overcomes the total of 8192 available registers on my MP, will it start using local memory to allocate what could potentially be allocated in registers?
So you’re saying that the register allocation is by currently active threads, right?
One thing I didn’t catch: You said that 2 blocks would be launched and 2 would be in queue. But if the maximum active threads is 768 and 3x256 = 768, wouldn’t 3 blocks be launched and 1 stay in queue?