Max threads/block

memecs · March 7, 2011, 12:34pm

Hello,

I don’t understand exactly what is the max number of threads per block. If I do something like:

int maxThreadsPerBlock;

  cuDeviceGetAttribute(&maxThreadsPerBlock, 

      CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, cuDevice);

I get maxThreadsPerBlock=512. But tried a matrix multiplication with much more threads/block and the result seems to be correct. So what is the exact limit?

Thanks

avidday · March 7, 2011, 12:50pm

My guess is that you are not using the correct kernel launch syntax for the runtime API. The correct syntax is

Kernel <<< number of blocks, number of threads per block >>>(arguments)

So if you are passing a number larger than 512 to the first launch parameter, you are not running more than 512 threads per block. If you pass a big number as the second parameter, the should be a kernel launch failure.

memecs · March 7, 2011, 1:22pm

My guess is that you are not using the correct kernel launch syntax for the runtime API. The correct syntax is
Kernel <<< number of blocks, number of threads per block >>>(arguments)
So if you are passing a number larger than 512 to the first launch parameter, you are not running more than 512 threads per block. If you pass a big number as the second parameter, the should be a kernel launch failure.

That doesn’t seem to be the problem. Here is my kernel launcher:

matrixMultiplicationOnGpu<<<blocksPerGrid, 

    threadsPerBlock>>>(d_A, d_B, d_C, L, M, N);

I don’t understand why I don’t get any error. Isn’t like CUDA is handling internally this issue? Maybe splitting the block?

seibert · March 7, 2011, 2:09pm

CUDA never splits blocks, so if that code runs without producing an error, then something very basic is wrong:

The driver is reporting the wrong number for maxThreadsPerBlock. Check the CUDA programming guide to see what it should be for your device.
You are not actually trapping CUDA errors. How are you checking for errors in this kernel launch?

memecs · March 7, 2011, 3:00pm

I am actually not trapping any error…

avidday · March 7, 2011, 3:06pm

That then begs the question, how do you know it is working?

memecs · March 7, 2011, 3:27pm

I am comparing the result with a cpu matrix multiplication. About that I have another question: when I init 1k x 1k matrices with random numbers between 0-1e4 or more I get wrong results in CUDA. What is exactly the numeric limit of a float in CUDA and why is it different from the cpu one?

Thanks again

avidday · March 7, 2011, 3:51pm

Depending on which type of GPU you use, single precision is either fully IEEE-754 2008 compliant, or “almost”. Unless your CPU reference result is being done in double precision or better, you really can’t judge which results might be “wrong”. More often than not, the structure of algorithms like dot products in CUDA actually produces better results than a pure serial CPU result done in the same precision, because of the use of multiply-add instructions which don’t suffer from intermediate rounding, algorithms which employ increased numbes of partial sums used to compute the result, and ordering phenomena that tends to make adjacent summation terms closer in magnitude, which reduces precision losses.

But there is still the question of how you are running code with more than 512 threads per block when the API reports 512 is the limit. Can you post a concise repro case demonstrating this amazing feat?

memecs · March 7, 2011, 4:02pm

Depending on which type of GPU you use, single precision is either fully IEEE-754 2008 compliant, or “almost”. Unless your CPU reference result is being done in double precision or better, you really can’t judge which results might be “wrong”. More often than not, the structure of algorithms like dot products in CUDA actually produces better results than a pure serial CPU result done in the same precision, because of the use of multiply-add instructions which don’t suffer from intermediate rounding, algorithms which employ increased numbes of partial sums used to compute the result, and ordering phenomena that tends to make adjacent summation terms closer in magnitude, which reduces precision losses.

But there is still the question of how you are running code with more than 512 threads per block when the API reports 512 is the limit. Can you post a concise repro case demonstrating this amazing feat?

Here it is:

// Find max number of threads per blocks allowed by the GPU.

  int maxThreadsPerBlock;

  cuDeviceGetAttribute(&maxThreadsPerBlock, 

      CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK, cuDevice);

// Setup grid and blocks

  dim3 threadsPerBlock = dim3(atoi(argv[1]), atoi(argv[2]));

  dim3 blocksPerGrid = dim3((N + threadsPerBlock.x - 1)/threadsPerBlock.x,

      (L + threadsPerBlock.y - 1)/threadsPerBlock.y);

cudaMemcpy(d_A, h_A, sizeA, cudaMemcpyHostToDevice);

  cudaMemcpy(d_B, h_B, sizeB, cudaMemcpyHostToDevice);

std::cout << "Max threads/block: " << maxThreadsPerBlock << std::endl;

  std::cout << "Threads/block: " << "(" << threadsPerBlock.x << "," << 

    threadsPerBlock.y << ")" << std::endl;

// Perform matrix multiplication on GPU.

  matrixMultiplicationOnGpu<<<blocksPerGrid, 

    threadsPerBlock>>>(d_A, d_B, d_C, L, M, N);

  cudaMemcpy(h_Cgpu, d_C, sizeC, cudaMemcpyDeviceToHost);

// Perform matrix multiplication on CPU.

  matrixMultiplicationOnCpu(h_A, h_B, h_Ccpu, L, M, N);

I have enclosed the full source, it takes two args: ./mmm threadsPerBlockX threadsPerBlockY
mmm.cu (3.1 KB)

avidday · March 7, 2011, 6:26pm

That code indeed doesn’t run when threadsPerBlock exceeds the limit the API reports. It will, however, happily return results from a previous, successful run from device memory, should such results exist, and that will make your checking code report a pass when nothing actually happened. This is mostly because you are resetting the random number seed in your array creation code which guarantees that every matrix of a given size will be identical. You can confirm this for yourself by (1) seeding the generator to something that changes, like something from the system clock, (2) adding a call to cudaGetLastError() after the kernel launch and checking for a non-zero result, which will indicate a launch failure, and (3) calling cudaMemset on d_C before you launch the kernel to clear any old results out of memory.

You might see something very different then.

memecs · March 7, 2011, 8:27pm

That code indeed doesn’t run when threadsPerBlock exceeds the limit the API reports. It will, however, happily return results from a previous, successful run from device memory, should such results exist, and that will make your checking code report a pass when nothing actually happened. This is mostly because you are resetting the random number seed in your array creation code which guarantees that every matrix of a given size will be identical. You can confirm this for yourself by (1) seeding the generator to something that changes, like something from the system clock, (2) adding a call to cudaGetLastError() after the kernel launch and checking for a non-zero result, which will indicate a launch failure, and (3) calling cudaMemset on d_C before you launch the kernel to clear any old results out of memory.

You might see something very different then.

I see. Thanks a lot for your help!

Topic		Replies	Views
I wonder maximum number of threads per block really limits the number of threads in each block. CUDA Programming and Performance	5	3978	February 9, 2024
Launching Kernel Fail CUDA Programming and Performance	15	3404	May 28, 2014
Max threads/blocks CUDA Programming and Performance	10	81	September 6, 2024
New findings needed to be verified: Maximum thread block is not 1024 in K20 CUDA Programming and Performance	4	754	November 17, 2014
threads and blocks CUDA Programming and Performance	3	1354	May 7, 2012
Maximum number of threads on thread block CUDA Programming and Performance	12	73898	September 21, 2023
finding the best number of threads per block CUDA Programming and Performance	3	7847	January 29, 2010
"max threads exceeded" error isn't reported CUDA Programming and Performance	1	3624	October 13, 2011
Threads Per Block Issue CUDA Programming and Performance	2	888	September 7, 2010
Run a million threads or blocks on a single kernel function, and still works. It supposed to be 512 at maximum, isn't it? CUDA Programming and Performance	4	1308	January 6, 2017

Max threads/block

Related topics