How determine max number of blocks and threads for a GPU?

Is there a way to determine the max number of grids, blocks per grid, and threads per block, on a given GPU?

for instance, the board Tesla C2075, as specified here
has 448 cores

What does this means as far as the max number of blocks n and threads m that I can use when launching the kernel by


I found the information in table 14

But I still don’t understand how many blocks can I create ?

My GPU is Tesla C2050/2075 and when I run the command deviceQuery I see:

Device 0: “Tesla C2075”
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5301 MBytes (5558501376 bytes)
(14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Max Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1566 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla C2075
Result = PASS

The table 14 that you mentioned specifies that for CUDA capability 2.0 there are

Maximum number of threads per block = 1024
Maximum number of resident blocks per multiprocessor = 8

Does this means that the largest number of block and threads that can run my function in parallel is


no it does not. GPU kernel launches can consist of many more blocks than just those that can be resident on a multiprocessor

The most immediate limits are these:

Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)

The kernel launch configuration: <<<…>>> specifies two dim3 quantities, the first being the number of blocks (in the grid) and the second being the number of threads in the block.

These are 3 dimensional quantities. Each dimension must satisfy the respective limit. Furthermore, the total number of threads in the block (i.e. the product of the 3 dimensions for threadblock size) must be less than 1024.

Anything within those limits is possible.

No. The number of blocks in a kernel launch is limited by the maximum dimensions of the grid. The first parameter between the triple angular brackets is the grid configuration (1D, 2D, or 3D): blocks in the grid. The second parameter is the block configuration (1D, 2D, or 3D): threads in each block.

So in this case, you can have a 1D grid of up to 65535 blocks or a 2D grid of up to 65536 x 655536 blocks, or a 3D grid of up to 65536 x 65536 x 65536 blocks.

In typical CUDA programs the number of blocks in a grid is significantly larger than the number of blocks that can execute simultaneously at any given time, which is max_resident_blocks_per_multiprocessor x number_of_multiprocessors (in this case: 8 x 14 = 112 blocks).

