I’m using Quadro FX 5600 on ubuntu 8 linux machine, and I have problems when I try to launch a kernel with large number of blocks (let’s say, more than a million blocks with each block size = 256.) The kernel uses only small amount of registers and shared memory per thread.
The following post (http://forums.nvidia.com/lofiversion/index.php?t60268.html) says that maximum number of blocks is 65535^2, but the total amount of available memory on device can limit actual maximum number of blocks for a given kernel.
Why the size of device memory can affect the actual number of available blocks?
Which are wrong among the following assumptions?
- Threads in all blocks execute the same code ==> no additional instruction cache overhead to increase # of blocks.
- Register and shared memory usage per block affects the number of blocks that can run in parallel. ==> register and shared memory usage affect # of active blocks, but not of maximum # of blocks in a grid.
- Thread-local memory may affect maximum # of blocks only if CUDA runtime preallocates thread-local memory for all threads in a grid. However, this can waste too much memory. ==> It seems that thread-local memory also affect only # of active blocks.
- global and constant memory are not related to # of blocks.