how to determine max number of blocks per kernel


I’m curious about maximum number of blocks on cuda. Let’s say that we have a GTX 8800 with 16 streaming multi processors, 8 ALUs, 16KB shared memory per multiprocessor, and a kernel where each block requires 1060 Bytes of shared memory.

As far as I understood the maximum number of blocks that can be run simultaneously will be limited by the shared memory requirements of the blocks. So I tried to calculate the maximum num of blocks for kernel execution as:
max number of blocks per Multiprocessors: 16KB/1060B → 15
max number of blocks on device → 15x16 = 240
However, this calculation is in collision with the experimental results. The before mentioned kernel achieves really good performance when it’s launched on a 64x64 grid = 4096 blocks , or larger. Finally, for 512x512 blocks, the kernel crashes.

So, I would expect that the blocks are replaced with the new ones on the same multiprocessor, as soon as they are processed. Is that correct or some other mechanism is used?
Finally, how is the max. number of blocks that could be run
a) by one multiprocessor
b) by one kernel
correctly determined?

Thanks in advance for helping me to better understand the execution model. :)

  1. devicequery will tell the largest possible gridsize (max # blocks)
  2. the occupancy calculator has all the other answers, download it, it is really useful

Denis, thanks for the tips. Here’re the values that I get from deviceQuery and the occupancy calc:

Device 0: "GeForce 8800 Ultra"

  Major revision number:                         1

  Minor revision number:                         0

  Total amount of global memory:                 804978688 bytes

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 8192

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          262144 bytes

  Texture alignment:                             256 bytes

  Clock rate:                                    1512000 kilohertz

The lines with maximum sizes of each dimension of block/grid are not really clear. So,

what’s the max. number of blocks that I can run in one kernel invocation? Is it 65535^2 or less?

1.) Select a GPU from the list (click):	G80


2.) Enter your resource usage:	

Threads Per Block	256

Registers Per Thread	8

Shared Memory Per Block (bytes)	1060

3.) GPU Occupancy Data is displayed here and in the graphs:	

Active Threads per Multiprocessor	768

Active Warps per Multiprocessor	24

Active Thread Blocks per Multiprocessor	3

Occupancy of each Multiprocessor	100%

Maximum Simultaneous Blocks per GPU	48


Maximum Thread Blocks Per Multiprocessor	Blocks

Limited by Max Warps / Multiprocessor	3

Limited by Registers / Multiprocessor	4

Limited by Shared Memory / Multiprocessor	10

According to the occupancy calculator, with this kernel, each multiprocessor can run simultaneously 3 blocks. Does it mean that a new block is loaded for execution on an MP as soon as it has finished processing of some block, thus allowing thousands of blocks to be executed in one kernel run?

It would be great, if you would help me clarify this very basic source of the confusion. It would really help to know how many blocks are expected to correctly execute under one such kernel. Thanks a lot in advance!

65535^2 is correct. The maximum number of blocks that can run is that. It is independent of your kernel (register usage, etc.) In your case you can have 3 blocks per multiprocessor at one given time. So you will have 16x3 = 48 blocks in flight at any given time. blocks that are finished are indeed replaced by blocks that have not run yet (which is why you cannot have communication between blocks)

thanks for the clarification! one additional question: our experimental kernel starts crashing e.g. for 512^2 (finishes computation, but displays some memory access message) and 1024^2 thread blocks (blocks completely), although it works correctly for smaller grid sizes e.g. 256^2, and slightly larger. According to your post it should scale to 65535^2 blocks which would be great. Do you have any tip where to look for the source of the problem?

I think I got it! What you shouldn’t ever overlook is that the total amount of available memory on device is only 768MB…

There is also another possible problem preventing you from reaching 65k^2 and that is the 5 sec. kernel limitation.

Total available memory on device, i think it is 4 GB. Please correct me if i am wrong

That would depend very much on the device. It certainly isn’t true for a GTX 8800.

Hello. Sorry for digging an old thread, but I would rather not start another cloned topic.

I’m going to implement quite a big code on CUDA.

Quick look reveals there are 1188 double variables declared in the code. If I truncate them to float, they should occupy no more than 1188 registers, while there are supposed to be 8192 registers per thread block.
Still, CUDA Occupancy calculator reads that there is a limit of 0 concurrent threads bounded by “registers per multiprocessor”. I declared the following:
Compute cacability 1.1
6 threads per block
1188 register per thread
0 or 1 byte of shared memory

Is that correct? Can’t I use large number of registers per single execution thread?

Using shared memory is somewhat less tempting, as I would need 4752 bytes, or 9504 for double precision. It means than my GTS250 would become at most 16-core processor using shared memory, probably not as quick and generally not much useful.

Also, the second issue.
Calculator says there are 8192 registers per thread block
Nsight test shows there are 8192 registers per multiprocessor.
Does it imply I run only one block at multiprocessor?

1188 registers per thread is far beyond the capability of any CUDA device (and any CPU I know either), which can have at most 63 or 127 registers per thread depending on compute capability. Not all variables need to reside in registers at the same time though (which also is how conventional CPUs handle this problem). Compile the code with [font=“Courier New”]nvcc -Xptxas=-v[/font] and the compiler will print the number of registers each kernel uses.