In the CUDA Documentation is written:
[i]2.2.2 Grid of Thread Blocks
There is a limited maximum number of threads that a block can contain. However, blocks of same dimensionality and size that execute the same kernel can be batched together into a grid of blocks, so that the total number of threads that can be launched in a single kernel invocation is much larger.
The maximum number of threads per block is 512
I understand that many threads can be “launched”, but I wonder how many threads or kernels are actually executed together.
Is it possible that more than one thread block is executed at the same time? Is this only the case when there are less than 512 Threads per block?
Is there more detailed information how the threads are distributed over the multiprocessors?