Passing too many blocks don't throw exception.


When I pass too many blocks kernel<<<grid(DIM,DIM),…>>> for example DIM=2000.
It doesn’t throw exception but when I retrieve the result using cudaMemcpy, it causes run-time error. How can I know how many blocks can I assign dynamically?


The number of blocks dependends on the gpu compute capability, but for all gpu it is bigger than 2000. The error comes from something else. The program crashes at cudaMemcpy because it is the next place the errors can be reported. See if you can check the errors as it is shown in this page:

Check your compiling options -arch=sm_xy (choose the appropriate xy for your device) and check that the number of threads per block and shared memory per block are within the limits. I suspect an out of bounds access, you can check this with cuda-memcheck tool. Use -Xptxas -v to see the resources used by each kernel.

Thanks pasoleatis, actually the size of the blocks that I allocated is DIM*DIM so 4,000,000 blocks. This is absurdly many blocks. But when I assign DIM=200, DIM^2 = 40,000, it works perfectly. So it is reasonable that the error comes from the kernel part not from the cudaMemcpy.
I guess I need to find another way to dynamically define maximum number of blocks that I can assign.


For the all cards you can use 65000*65000 (more than 3 billion blocks). This is more than 4 millions. On Kepler cards you can use 2 million * 65000. The number of blocks you use is very small. If it works for DIM=200 (DIM^2=40000), but not for Dim=2000 (DIM^2=4,000,000), you have a bug in the kernel or the number of threads is wrong.

If you look at wikipedia cuda page there is a table with the number of blocks for each generation of gpu. (

Yep. I found out that my lab computer has version 1.0 graphic card
Maximum x- or y-dimension of a block 512 (
That explains why it works up to 500 dim for x and y dimension and throw runtime error when its more than 500. I need to upgrade the graphic card!



I think you misunderstood something. Also on the 1.0 you can submit millions of blocks. What you are refereeing is the number of threads per block. Different things.

This means that if try to run a kernel:
blocks.x,y,z can be up to 65000
while threads.x<=512 threads.y<=512 and threads.z<=64 with threads.xthreads.ythreads.z<=512