How determine max number of blocks and threads for a GPU?

f600 · December 12, 2018, 8:58pm

Is there a way to determine the max number of grids, blocks per grid, and threads per block, on a given GPU?

for instance, the board Tesla C2075, as specified here
https://www.nvidia.com/docs/IO/43395/NV-DS-Tesla-C2075.pdf
has 448 cores

What does this means as far as the max number of blocks n and threads m that I can use when launching the kernel by

my_function<<<n,m>>>()

saulocpp · December 13, 2018, 12:29pm

You got a few answers for the same question yesterday:
[url]https://devtalk.nvidia.com/default/topic/1045204/cuda-programming-and-performance/where-do-i-check-the-number-of-blocks-and-threads-that-are-available-on-tesla-c2050-/[/url]

f600 · December 13, 2018, 4:03pm

Thanks
I found the information in table 14

But I still don’t understand how many blocks can I create ?

My GPU is Tesla C2050/2075 and when I run the command deviceQuery I see:

Device 0: “Tesla C2075”
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5301 MBytes (5558501376 bytes)
(14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Max Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1566 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla C2075
Result = PASS

The table 14 that you mentioned specifies that for CUDA capability 2.0 there are

Maximum number of threads per block = 1024
Maximum number of resident blocks per multiprocessor = 8

Does this means that the largest number of block and threads that can run my function in parallel is

my_function<<<8,1024>>>()

Robert_Crovella · December 13, 2018, 4:31pm

no it does not. GPU kernel launches can consist of many more blocks than just those that can be resident on a multiprocessor

The most immediate limits are these:

Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)

The kernel launch configuration: <<<…>>> specifies two dim3 quantities, the first being the number of blocks (in the grid) and the second being the number of threads in the block.

These are 3 dimensional quantities. Each dimension must satisfy the respective limit. Furthermore, the total number of threads in the block (i.e. the product of the 3 dimensions for threadblock size) must be less than 1024.

Anything within those limits is possible.

njuffa · December 13, 2018, 4:40pm

No. The number of blocks in a kernel launch is limited by the maximum dimensions of the grid. The first parameter between the triple angular brackets is the grid configuration (1D, 2D, or 3D): blocks in the grid. The second parameter is the block configuration (1D, 2D, or 3D): threads in each block.

So in this case, you can have a 1D grid of up to 65535 blocks or a 2D grid of up to 65536 x 655536 blocks, or a 3D grid of up to 65536 x 65536 x 65536 blocks.

In typical CUDA programs the number of blocks in a grid is significantly larger than the number of blocks that can execute simultaneously at any given time, which is max_resident_blocks_per_multiprocessor x number_of_multiprocessors (in this case: 8 x 14 = 112 blocks).

[I see I should have F5-ed, as Robert Crovella already supplied an answer …]

Topic		Replies	Views
the maximum number of blocks and threads CUDA Programming and Performance	10	7099	September 4, 2008
Question regarding maximum amount of blocks CUDA Programming and Performance	2	854	January 28, 2011
Maximum possible number of threads (Total) CUDA Programming and Performance	1	1042	December 28, 2009
maximum thread numbers CUDA Programming and Performance	5	12133	October 4, 2011
limits on threads per kernel launch CUDA Programming and Performance	3	2772	March 14, 2012
CUDA - thread block confusion concept clearity sought CUDA Programming and Performance	6	3054	November 10, 2011
is there a limitation for total number of threads? CUDA Programming and Performance	5	5325	October 22, 2009
Questions about Block and Grid CUDA Programming and Performance	4	3597	February 26, 2008
Max size of each dimension in 1.3 compute capable devices ? CUDA Programming and Performance	2	3018	May 6, 2009
Confusion about thread per block CUDA Programming and Performance	1	814	July 24, 2009

How determine max number of blocks and threads for a GPU?

Related topics