Maximum number of threads on thread block


I started with CUDA 2 days ago. I installed the drivers of my Tesla K20m and the CUDA ToolKit. I tested the different examples and all work fine, but now I have a question… I know that this is a stupid question but just I’m starting with parallel computer in my job and I’m not sure.

If I execute the ‘deviceQuery’ I obtain the follow results:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla K20m"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 4742 MBytes (4972412928 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Max Clock rate:                            706 MHz (0.71 GHz)
  Memory Clock rate:                             2600 Mhz
  Memory Bus Width:                              320-bit
  L2 Cache Size:                                 1310720 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla K20m
Result = PASS

My question is about the ‘Maximum number of threads per block’ and ‘Max dimension size of a thread block (x,y,z)’. I understand that 1024 is the maximum number ob threads that I can have in one block but, about the max dimensions size of a thread block is not clear for me… is it means that I can have xyz threads on block? Or is it means that the maximum number of threads in each dimension are x, y, z (in this case 1024,1024,64)?

In my case, if I want a thread block with X=1 and Y=1, the maximum number of threads for Y has to be 64?

Thank you and sorry for the question.

Thanks and best regards.

1 Like

There are multiple limits. All must be satisfied.

  1. The maximum number of threads in the block is limited to 1024. This is the product of whatever your threadblock dimensions are (xyz). For example (32,32,1) creates a block of 1024 threads. (33,32,1) is not legal, since 33*32*1 > 1024.

  2. The maximum x-dimension is 1024. (1024,1,1) is legal. (1025,1,1) is not legal.

  3. The maximum y-dimension is 1024. (1,1024,1) is legal. (1,1025,1) is not legal.

  4. The maximum z-dimension is 64. (1,1,64) is legal. (2,2,64) is also legal. (1,1,65) is not legal.

Also, threadblock dimensions of 0 in any position are not legal.

Your choice of threadblock dimensions (x,y,z) must satisfy each of the rules 1-4 above.

You should also do proper cuda error checking. Not sure what that is? Google “proper cuda error checking” and take the first hit.

Also run your codes with cuda-memcheck.

Do these steps before asking others for help. Even if you don’t understand the error output, it will be useful to others trying to help you.


Hi txbob,

Thanks for your answer, now I understand how it works.

I will keep in mind to do my experiments with cuda-memcheck.

txbob, the numbers you provided above, are they particular to a specific family? By the date, I’m assuming these are Pascal limits?

If it is the case, then a proper tuning of the kernel parameters, so to speak, would require querying in runtime the CC of the device and then calculate the maximum number of blocks and threads per block for this device?

In the case where my data is so large that a single sample can’t be individually assigned to a specific thread, then we have to rearrange the kernel parameters so that each thread will process more than 1 piece of data?

I can take as example the nVidia video of “First CUDA program” and the infamous and ever repeated vector add, on which all the 1024 elements can be assigned to their own thread. The narrator says that if you have more threads than data to be processed, it won’t be a problem, but then we have to guess when the opposite happens, which is very often the case in the real world.

They apply to compute capability 2.0 and higher: