Relation between # of blocks and devicememory size questions about blocks and memory

I’m using Quadro FX 5600 on ubuntu 8 linux machine, and I have problems when I try to launch a kernel with large number of blocks (let’s say, more than a million blocks with each block size = 256.) The kernel uses only small amount of registers and shared memory per thread.

The following post ( says that maximum number of blocks is 65535^2, but the total amount of available memory on device can limit actual maximum number of blocks for a given kernel.

Why the size of device memory can affect the actual number of available blocks?

Which are wrong among the following assumptions?

  1. Threads in all blocks execute the same code ==> no additional instruction cache overhead to increase # of blocks.
  2. Register and shared memory usage per block affects the number of blocks that can run in parallel. ==> register and shared memory usage affect # of active blocks, but not of maximum # of blocks in a grid.
  3. Thread-local memory may affect maximum # of blocks only if CUDA runtime preallocates thread-local memory for all threads in a grid. However, this can waste too much memory. ==> It seems that thread-local memory also affect only # of active blocks.
  4. global and constant memory are not related to # of blocks.

I think the reference to the device memory limitation was referring to the data that you allocate in device memory. I.e. if each thread works on a single float then 65536^2 threads would require allocating lots of GB of floats before the kernel call and the malloc would fail.

Do you cudaThreadSynchronize() and check the error status after calling the kernel? What error is returned?

I found that my error was caused by other thing, which is different from the things that I mentioned above.

By the way, In your comment, for what kind of data is the malloc() used? and why the malloc() should be called for each thread?

No, no allocations called each thread.

Consider a kernel that does:

a[i] = a[i] + 1.0f;

and you run one thread per element (typical CUDA programming practice…). Now, if you ran 65535^2 blocks each with 256 threads per block that means you have 1 099 478 073 600 floats (4 TiB) to allocate for the array a. So not many mallocs, just one impossibly large malloc.