Problems with maximum grid dimension


I am a rookie in programming using CUDA. I encountered some problems that seemed weird to me when I used different grid dimensions to run my kernel. Would you please explain to me why at certain conditions the kernel does not work properly? Thank you very much in advance!

System: Win7 64bit
Environment: Visual Studio 2013
GPU: GeForce GTX 970M 6GB (Major.Minor: 5.2)
CUDA toolkit version: 8.0

Below is a concise version of my code:

global void init(int* d_Ld, int* d_Rd, int Nt)
int blockId = blockIdx.x + blockIdx.ygridDim.x;
int id = blockId
(blockDim.xblockDim.y) + threadIdx.yblockDim.x + threadIdx.x;
int stride = blockDim.x * blockDim.y * gridDim.x * gridDim.y;
while (id < Nt)
d_Ld[id] = id;
d_Rd[id] = id;
id += stride;

int Nt=10241024316;
int dim1_grid dim2_grid dim3_grid; (to be determined)
int dim1_block dim2_block dim3_block; (to be determined)
dim3 griddim(dim1_grid, dim2_grid, dim3_grid);
dim3 blockdim(dim1_block, dim2_block, dim3_block);

int* d_Ld, d_Rd;
int* h_probe;

if (cudaMalloc((void**)&Ld, sizeof(int) * Nt)!= cudaSuccess) cout << “ERROR!” << endl;
if (cudaMalloc((void**)&Rd, sizeof(int) * Nt)!=cudaSuccess) cout<<“ERROR!”<<endl;
h_probe = (int*)calloc(Nt, sizeof(int));

init<<<griddim, blockdim>>>(d_Ld, d_Rd, Nt);
printf(“GPU kernel error: %s\n”, cudaGetErrorString(cudaPeekAtLastError()));

cudaMemcpy(h_probe, d_Ld, sizeof(int) * Nt, cudaMemcpyDeviceToHost);

Test Results at different conditions:

  1. dim1_grid=65535, dim2_grid=1, dim3_grid=1, dim1_block=1024, dim2_block=1, dim3_block=1;
    GPU kernel error: no error
    This is the correct result.

  2. dim1_grid=65536, dim2_grid=1, dim3_grid=1, dim1_block=1024, dim2_block=1, dim3_block=1;
    GPU kernel error: invalid argument
    This is the wrong result.

I know this is probably related to the maximum dimensions of the grid. However I checked using cudaDeviceProp and the maxGridSize[0] is 2147483647, while maxGridSize[1] and [2] are both 65535. Why does this happen?

maybe you are not compiling for an architecture that matches your GPU.

CUDA 8.0 has a default of cc2.0 which will limit the first grid dimension to 65535, even though your GPU appears to support a grid dimension that is much larger. There are literally dozens of postings like this on websites such as

Thank you so much for your explanation! That really helps!