Max threads/block


I don’t understand exactly what is the max number of threads per block. If I do something like:

int maxThreadsPerBlock;



I get maxThreadsPerBlock=512. But tried a matrix multiplication with much more threads/block and the result seems to be correct. So what is the exact limit?


  1. My guess is that you are not using the correct kernel launch syntax for the runtime API. The correct syntax is
Kernel <<< number of blocks, number of threads per block >>>(arguments)

So if you are passing a number larger than 512 to the first launch parameter, you are not running more than 512 threads per block. If you pass a big number as the second parameter, the should be a kernel launch failure.

That doesn’t seem to be the problem. Here is my kernel launcher:


    threadsPerBlock>>>(d_A, d_B, d_C, L, M, N);

I don’t understand why I don’t get any error. Isn’t like CUDA is handling internally this issue? Maybe splitting the block?

CUDA never splits blocks, so if that code runs without producing an error, then something very basic is wrong:

  • The driver is reporting the wrong number for maxThreadsPerBlock. Check the CUDA programming guide to see what it should be for your device.

  • You are not actually trapping CUDA errors. How are you checking for errors in this kernel launch?

I am actually not trapping any error…

That then begs the question, how do you know it is working?

I am comparing the result with a cpu matrix multiplication. About that I have another question: when I init 1k x 1k matrices with random numbers between 0-1e4 or more I get wrong results in CUDA. What is exactly the numeric limit of a float in CUDA and why is it different from the cpu one?

Thanks again

Depending on which type of GPU you use, single precision is either fully IEEE-754 2008 compliant, or “almost”. Unless your CPU reference result is being done in double precision or better, you really can’t judge which results might be “wrong”. More often than not, the structure of algorithms like dot products in CUDA actually produces better results than a pure serial CPU result done in the same precision, because of the use of multiply-add instructions which don’t suffer from intermediate rounding, algorithms which employ increased numbes of partial sums used to compute the result, and ordering phenomena that tends to make adjacent summation terms closer in magnitude, which reduces precision losses.

But there is still the question of how you are running code with more than 512 threads per block when the API reports 512 is the limit. Can you post a concise repro case demonstrating this amazing feat?

Here it is:

// Find max number of threads per blocks allowed by the GPU.

  int maxThreadsPerBlock;



// Setup grid and blocks

  dim3 threadsPerBlock = dim3(atoi(argv[1]), atoi(argv[2]));

  dim3 blocksPerGrid = dim3((N + threadsPerBlock.x - 1)/threadsPerBlock.x,

      (L + threadsPerBlock.y - 1)/threadsPerBlock.y);

cudaMemcpy(d_A, h_A, sizeA, cudaMemcpyHostToDevice);

  cudaMemcpy(d_B, h_B, sizeB, cudaMemcpyHostToDevice);

std::cout << "Max threads/block: " << maxThreadsPerBlock << std::endl;

  std::cout << "Threads/block: " << "(" << threadsPerBlock.x << "," << 

    threadsPerBlock.y << ")" << std::endl;

// Perform matrix multiplication on GPU.


    threadsPerBlock>>>(d_A, d_B, d_C, L, M, N);

  cudaMemcpy(h_Cgpu, d_C, sizeC, cudaMemcpyDeviceToHost);

// Perform matrix multiplication on CPU.

  matrixMultiplicationOnCpu(h_A, h_B, h_Ccpu, L, M, N);

I have enclosed the full source, it takes two args: ./mmm threadsPerBlockX threadsPerBlockY (3.1 KB)

That code indeed doesn’t run when threadsPerBlock exceeds the limit the API reports. It will, however, happily return results from a previous, successful run from device memory, should such results exist, and that will make your checking code report a pass when nothing actually happened. This is mostly because you are resetting the random number seed in your array creation code which guarantees that every matrix of a given size will be identical. You can confirm this for yourself by (1) seeding the generator to something that changes, like something from the system clock, (2) adding a call to cudaGetLastError() after the kernel launch and checking for a non-zero result, which will indicate a launch failure, and (3) calling cudaMemset on d_C before you launch the kernel to clear any old results out of memory.

You might see something very different then.

I see. Thanks a lot for your help!