Relation between # of blocks and devicememory size questions about blocks and memory

lee222 · July 20, 2008, 8:10pm

I’m using Quadro FX 5600 on ubuntu 8 linux machine, and I have problems when I try to launch a kernel with large number of blocks (let’s say, more than a million blocks with each block size = 256.) The kernel uses only small amount of registers and shared memory per thread.

The following post (http://forums.nvidia.com/lofiversion/index.php?t60268.html) says that maximum number of blocks is 65535^2, but the total amount of available memory on device can limit actual maximum number of blocks for a given kernel.

Why the size of device memory can affect the actual number of available blocks?

Which are wrong among the following assumptions?

Threads in all blocks execute the same code ==> no additional instruction cache overhead to increase # of blocks.
Register and shared memory usage per block affects the number of blocks that can run in parallel. ==> register and shared memory usage affect # of active blocks, but not of maximum # of blocks in a grid.
Thread-local memory may affect maximum # of blocks only if CUDA runtime preallocates thread-local memory for all threads in a grid. However, this can waste too much memory. ==> It seems that thread-local memory also affect only # of active blocks.
global and constant memory are not related to # of blocks.

MisterAnderson42 · July 21, 2008, 12:38pm

I think the reference to the device memory limitation was referring to the data that you allocate in device memory. I.e. if each thread works on a single float then 65536^2 threads would require allocating lots of GB of floats before the kernel call and the malloc would fail.

Do you cudaThreadSynchronize() and check the error status after calling the kernel? What error is returned?

lee222 · July 23, 2008, 4:37am

I found that my error was caused by other thing, which is different from the things that I mentioned above.

By the way, In your comment, for what kind of data is the malloc() used? and why the malloc() should be called for each thread?

MisterAnderson42 · July 23, 2008, 5:05am

No, no allocations called each thread.

Consider a kernel that does:

a[i] = a[i] + 1.0f;

and you run one thread per element (typical CUDA programming practice…). Now, if you ran 65535^2 blocks each with 256 threads per block that means you have 1 099 478 073 600 floats (4 TiB) to allocate for the array a. So not many mallocs, just one impossibly large malloc.

Topic		Replies	Views
how to determine max number of blocks per kernel CUDA Programming and Performance	10	17457	September 11, 2011
maximum number of blocks CUDA Programming and Performance	3	2505	April 10, 2008
Limit to Number of Blocks? Noob Question CUDA Programming and Performance	4	3100	May 16, 2008
memory overflow? CUDA Programming and Performance	1	5190	July 27, 2009
max thread per block and memory device question CUDA Programming and Performance	2	17074	January 9, 2009
Shared memory limits and cudaError_enum How to precisely determine how much of the shared memory is CUDA Programming and Performance	5	2921	April 29, 2009
Question regarding maximum amount of blocks CUDA Programming and Performance	2	912	January 28, 2011
Can not use more than 16*256 threads! CUDA Programming and Performance	7	2580	August 4, 2008
max number of block CUDA Programming and Performance	21	18229	April 20, 2010
device organization CUDA Programming and Performance	1	4219	April 6, 2008

Relation between # of blocks and devicememory size questions about blocks and memory

Related topics