Maximum number of threads on thread block

aLbErT_h · November 25, 2016, 3:58pm

Hi,

I started with CUDA 2 days ago. I installed the drivers of my Tesla K20m and the CUDA ToolKit. I tested the different examples and all work fine, but now I have a question… I know that this is a stupid question but just I’m starting with parallel computer in my job and I’m not sure.

If I execute the ‘deviceQuery’ I obtain the follow results:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla K20m"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 4742 MBytes (4972412928 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Max Clock rate:                            706 MHz (0.71 GHz)
  Memory Clock rate:                             2600 Mhz
  Memory Bus Width:                              320-bit
  L2 Cache Size:                                 1310720 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla K20m
Result = PASS

My question is about the ‘Maximum number of threads per block’ and ‘Max dimension size of a thread block (x,y,z)’. I understand that 1024 is the maximum number ob threads that I can have in one block but, about the max dimensions size of a thread block is not clear for me… is it means that I can have xyz threads on block? Or is it means that the maximum number of threads in each dimension are x, y, z (in this case 1024,1024,64)?

In my case, if I want a thread block with X=1 and Y=1, the maximum number of threads for Y has to be 64?

Thank you and sorry for the question.

Thanks and best regards.

Robert_Crovella · November 25, 2016, 4:09pm

There are multiple limits. All must be satisfied.

The maximum number of threads in the block is limited to 1024. This is the product of whatever your threadblock dimensions are (xyz). For example (32,32,1) creates a block of 1024 threads. (33,32,1) is not legal, since 33*32*1 > 1024.
The maximum x-dimension is 1024. (1024,1,1) is legal. (1025,1,1) is not legal.
The maximum y-dimension is 1024. (1,1024,1) is legal. (1,1025,1) is not legal.
The maximum z-dimension is 64. (1,1,64) is legal. (2,2,64) is also legal. (1,1,65) is not legal.

Also, threadblock dimensions of 0 in any position are not legal.

Your choice of threadblock dimensions (x,y,z) must satisfy each of the rules 1-4 above.

You should also do proper cuda error checking. Not sure what that is? Google “proper cuda error checking” and take the first hit.

Also run your codes with cuda-memcheck.

Do these steps before asking others for help. Even if you don’t understand the error output, it will be useful to others trying to help you.

aLbErT_h · November 25, 2016, 4:28pm

Hi txbob,

Thanks for your answer, now I understand how it works.

I will keep in mind to do my experiments with cuda-memcheck.

saulocpp · June 12, 2018, 9:41am

txbob, the numbers you provided above, are they particular to a specific family? By the date, I’m assuming these are Pascal limits?

If it is the case, then a proper tuning of the kernel parameters, so to speak, would require querying in runtime the CC of the device and then calculate the maximum number of blocks and threads per block for this device?

In the case where my data is so large that a single sample can’t be individually assigned to a specific thread, then we have to rearrange the kernel parameters so that each thread will process more than 1 piece of data?

I can take as example the nVidia video of “First CUDA program” and the infamous and ever repeated vector add, on which all the 1024 elements can be assigned to their own thread. The narrator says that if you have more threads than data to be processed, it won’t be a problem, but then we have to guess when the opposite happens, which is very often the case in the real world.

Robert_Crovella · October 9, 2018, 8:46pm

They apply to compute capability 2.0 and higher:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability

A grid as we have seen can be quite large (more than quintillions of threads: 9,444,444,733,164,249,676,800 threads maximum, exactly. No current CUDA GPU has the memory space or address space to support that number of elements in a dataset. That maximum number exceeds 2^64 by several orders of magnitude). However to decouple the size of the grid from the size of your dataset, a canonical method is the grid-stride loop.

asandip785 · December 3, 2021, 10:09pm

Along the same lines, I queried the device properties. Is the maximum number of blocks per grid in the x, y and z:
Max grid size, dim(0): 2147483647
Max grid size, dim(1): 65535
Max grid size, dim(2): 65535
?

Does this mean in dim[0], could have a maximum of 2147483647 blocks with 1024 threads per block?

Robert_Crovella · December 3, 2021, 10:18pm

Yes. To both questions.

bansalgarish · March 21, 2023, 10:15am

I also queried my graphic card got the same answer , does rhis mean i can have 1billion in dim[0] ,100 in dim [1] and no of thread =1000

Robert_Crovella · March 21, 2023, 2:08pm

yes

spring_wind · September 21, 2023, 1:09am

Can nvcc compiler give me this error if I did not obey one of specifications in CC when building not runnning this program?

Robert_Crovella · September 21, 2023, 1:15am

no, nvcc doesn’t do that. In the general case, it cannot do that, because these are all potentially runtime-determined quantities. They are not necessarily compile-time constants, therefore in the general case the compiler cannot determine what dimensions you are attempting to use.

If they happen to be compile-time constants, then in theory nvcc could possibly do that sort of checking. It does not, which you can confirm yourself with a trivial test.

spring_wind · September 21, 2023, 2:02am

I see, if I launch a kernel using “<<<>>>”, should I check the error after this line or call cuda sync like functions and then check error?

Robert_Crovella · September 21, 2023, 12:53pm

potentially both, although launch configuration errors will be caught by a call to cudaGetLastError() immediately after a kernel launch line. For a good treatment of error-checking, I usually refer to this. Additional discussion can be found in unit 12 here, explaining the different types of errors that may occur due to a kernel launch, and clarifying why we often recommend 2 types of error checking after a kernel launch.

Topic		Replies	Views
How determine max number of blocks and threads for a GPU? CUDA Programming and Performance	4	21716	December 13, 2018
CUDA - thread block confusion concept clearity sought CUDA Programming and Performance	6	3086	November 10, 2011
Question regarding maximum amount of blocks CUDA Programming and Performance	2	886	January 28, 2011
deviceQuery CUDA Programming and Performance	4	2144	June 14, 2007
What is the maximum number of blocks I can use? CUDA Programming and Performance	3	4105	February 8, 2022
Maximum block per grid CUDA Programming and Performance cuda	4	4477	March 24, 2023
Understanding deviceQuery CUDA Programming and Performance	2	4186	June 28, 2014
Maximum possible number of threads (Total) CUDA Programming and Performance	1	1067	December 28, 2009
Thread Number Limitation CUDA Programming and Performance	3	3964	December 22, 2008
Maximum number of threads in a GPU CUDA Programming and Performance cuda	5	7427	December 29, 2022

Maximum number of threads on thread block

Related topics