Thread Scheduling / Limit maximum threads per block in each dimension vs Maximum thread on a SM


I am new to CUDA programming and I am stuck up with this thing.

In my PC with a GeForce GT 220 card, maximum threads on a SM is 512 and maxBlockDim is (51251264).

When I allocate more than 512 threads per block in one dimension then the program will crash.

In another system with Quadro 600 card, maximum threads per SM is 1536 and maxBlockDim is (1024102464).

But when I allocate 1536 or 2048 or 4096 threads per block in one dimension, the program executes properly. Hardly I find any error.

Is there any practical limit on number of threads can be there per block ? And what happens if number of threads increases ?

If you are allocating more threads per block than is allowed (note that on compute capability 2.x the maximum number of threads per block is 1024, even though the maximum number of threads per SM is 1536), then an error code will be returned by one of the next CUDA function calls. How are you checking the return codes of the CUDA functions?


Thanks for your reply.

This is my Kernel Function

global void myKernel(int * arr)


arr[threadIdx.x] = threadIdx.x;


In main(), I am fetching the error like this

int size = 1025; //1024 is the limit in 2.x


err = cudaGetLastError();

printf(“\n Error %d = %s”,err,cudaGetErrorString(err));

The error is

Errror 9 : invalid configuration argument

When I print the result, I get 0,1,2…1023,garbage

But I am assigning arr[threadIdx.x] = threadIdx.x

so there must be a clear memory violation.

How can it assign arr[1024] to some value ?

Thanks and Regards,


I’m not sure I understand the problem. Since you got a CUDA error when trying to launch your kernel, the contents of dArr will be undefined. It might be left with whatever values were already in that part of GPU memory when you started. CUDA does not zero out memory when you allocate it.