Grid size performance implications

What are the implications on having a potentially huge grid?

I ask this because, if there is a limited number of registers that is equally divided between all threads, it could overcome this limit, right?

Let’s say for instance that I only have 1 MP, which means a total of 8192 available registers. So if I launch my kernel with <<<4,256>>> each thread has a total of 8 available registers. What if a launch it with <<<40,256>>>, meaning a total of 10240. Since this number overcomes the total of 8192 available registers on my MP, will it start using local memory to allocate what could potentially be allocated in registers?

No. suppose maximum active threads per SM (multiprocessor) is 768, then according to your configuration,

256 threads per block, 2K registers per block, then you have two blocks per SM.

Assumption: only 1SP in GPU, then

case 1: execution configuration is <<<4,256>>>, then you have 4 blocks, however you have only 1SM.

so only 2 blocks will dispatch to the only SM and remaining 2 blocks are in the queue.

after SM finish one block, scheduler would assign third block into SM.

hence size of grid is not a problem.

So you’re saying that the register allocation is by currently active threads, right?

One thing I didn’t catch: You said that 2 blocks would be launched and 2 would be in queue. But if the maximum active threads is 768 and 3x256 = 768, wouldn’t 3 blocks be launched and 1 stay in queue?

sorry, I make a mistake.

1 SM has 3 active blocks, each block has 2K registers and 256 threads.

nvcc would assign 3 x 2K = 6K registers for 3 active blocks.

now SM has 768 active threads (belong to 3 active blocks) and 768 threads would be grouped into

24 warps ( 32 threads per warp). 8 SP (stream processor) of a SM would execute a warp per instruction,

when a wrap does I/O waiting, scheduler would choose another warp among remaining 23 warps to be active

and executed via 8 SPs. However GPU does not have to do context switch since register’s set of each warp is

mutual exclusive. This is different from context switch of threads in CPU.

So threads in GPU is light-weight.