Maximum number of threads on thread block


I started with CUDA 2 days ago. I installed the drivers of my Tesla K20m and the CUDA ToolKit. I tested the different examples and all work fine, but now I have a question… I know that this is a stupid question but just I’m starting with parallel computer in my job and I’m not sure.

If I execute the ‘deviceQuery’ I obtain the follow results:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla K20m"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 4742 MBytes (4972412928 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Max Clock rate:                            706 MHz (0.71 GHz)
  Memory Clock rate:                             2600 Mhz
  Memory Bus Width:                              320-bit
  L2 Cache Size:                                 1310720 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla K20m
Result = PASS

My question is about the ‘Maximum number of threads per block’ and ‘Max dimension size of a thread block (x,y,z)’. I understand that 1024 is the maximum number ob threads that I can have in one block but, about the max dimensions size of a thread block is not clear for me… is it means that I can have xyz threads on block? Or is it means that the maximum number of threads in each dimension are x, y, z (in this case 1024,1024,64)?

In my case, if I want a thread block with X=1 and Y=1, the maximum number of threads for Y has to be 64?

Thank you and sorry for the question.

Thanks and best regards.


There are multiple limits. All must be satisfied.

  1. The maximum number of threads in the block is limited to 1024. This is the product of whatever your threadblock dimensions are (xyz). For example (32,32,1) creates a block of 1024 threads. (33,32,1) is not legal, since 33*32*1 > 1024.

  2. The maximum x-dimension is 1024. (1024,1,1) is legal. (1025,1,1) is not legal.

  3. The maximum y-dimension is 1024. (1,1024,1) is legal. (1,1025,1) is not legal.

  4. The maximum z-dimension is 64. (1,1,64) is legal. (2,2,64) is also legal. (1,1,65) is not legal.

Also, threadblock dimensions of 0 in any position are not legal.

Your choice of threadblock dimensions (x,y,z) must satisfy each of the rules 1-4 above.

You should also do proper cuda error checking. Not sure what that is? Google “proper cuda error checking” and take the first hit.

Also run your codes with cuda-memcheck.

Do these steps before asking others for help. Even if you don’t understand the error output, it will be useful to others trying to help you.


Hi txbob,

Thanks for your answer, now I understand how it works.

I will keep in mind to do my experiments with cuda-memcheck.

txbob, the numbers you provided above, are they particular to a specific family? By the date, I’m assuming these are Pascal limits?

If it is the case, then a proper tuning of the kernel parameters, so to speak, would require querying in runtime the CC of the device and then calculate the maximum number of blocks and threads per block for this device?

In the case where my data is so large that a single sample can’t be individually assigned to a specific thread, then we have to rearrange the kernel parameters so that each thread will process more than 1 piece of data?

I can take as example the nVidia video of “First CUDA program” and the infamous and ever repeated vector add, on which all the 1024 elements can be assigned to their own thread. The narrator says that if you have more threads than data to be processed, it won’t be a problem, but then we have to guess when the opposite happens, which is very often the case in the real world.

They apply to compute capability 2.0 and higher:

A grid as we have seen can be quite large (more than quintillions of threads: 9,444,444,733,164,249,676,800 threads maximum, exactly. No current CUDA GPU has the memory space or address space to support that number of elements in a dataset. That maximum number exceeds 2^64 by several orders of magnitude). However to decouple the size of the grid from the size of your dataset, a canonical method is the grid-stride loop.

Along the same lines, I queried the device properties. Is the maximum number of blocks per grid in the x, y and z:
Max grid size, dim(0): 2147483647
Max grid size, dim(1): 65535
Max grid size, dim(2): 65535

Does this mean in dim[0], could have a maximum of 2147483647 blocks with 1024 threads per block?

Yes. To both questions.

1 Like

I also queried my graphic card got the same answer , does rhis mean i can have 1billion in dim[0] ,100 in dim [1] and no of thread =1000


Can nvcc compiler give me this error if I did not obey one of specifications in CC when building not runnning this program?

no, nvcc doesn’t do that. In the general case, it cannot do that, because these are all potentially runtime-determined quantities. They are not necessarily compile-time constants, therefore in the general case the compiler cannot determine what dimensions you are attempting to use.

If they happen to be compile-time constants, then in theory nvcc could possibly do that sort of checking. It does not, which you can confirm yourself with a trivial test.

I see, if I launch a kernel using “<<<>>>”, should I check the error after this line or call cuda sync like functions and then check error?

potentially both, although launch configuration errors will be caught by a call to cudaGetLastError() immediately after a kernel launch line. For a good treatment of error-checking, I usually refer to this. Additional discussion can be found in unit 12 here, explaining the different types of errors that may occur due to a kernel launch, and clarifying why we often recommend 2 types of error checking after a kernel launch.