Greetings to all.

I am new to cuda and trying to get my feet wet. I’ve actually gotten my kernels to work, but yesterday, while testing various sizes of my input data, I ran into some some complications which I’ve not yet been able to solve up till now.

Actually, the problem arose when I tried changing my grid from a one dimensional grid to a two dimensional grid. I needed to do this because my input data size required more than 65535 blocks, which is more than is allowed in gridDim.x or gridDim.y.

[codebox]

**global** void cuda_kernel(unsigned char *Arr, int depth, int oneDsize){

int thrdNr = (blockIdx.x * gridDim.x + blockIdx.y) * blockDim.x + threadIdx.x;

int element = thrdNr * byteDepth;

if (element + depth - 1 < oneDsize){

```
Arr[(element + 2)] = 0;
Arr[(element + 1)] = 0;
```

}

}[/codebox]

[codebox]int main(int argc, char** argv){

//code for copying data to Arr on graphic card ommited for clarity

```
int block_size = 3;
int n_blocks = 12288000;
int gridY = n_blocks/cudaDeviceProperties.maxGridSize[0] + (n_blocks%cudaDeviceProperties.maxGridSize[0] == 0 ? 0:1);
dim3 dimGrid(cudaDeviceProperties.maxGridSize[0], gridY);
int depth, size;
depth = 4;
size = 147456000;
cuda_filter_blue <<< dimGrid, block_size>>> (devptr, depth, size);
cudaThreadSynchronize();
cudaError_t err= cudaGetLastError();
```

}[/codebox]

When I run this program, I get a cudaErrorLaunchFailure and I can’t seem to find out why. I noticed that when I change the line in the kernel from *“dim3 dimGrid(cudaDeviceProperties.maxGridSize[0], gridY);”* to *“dim3 dimGrid(gridY, cudaDeviceProperties.maxGridSize[0]);”*, that the kernel was able to be launched, but did not perform the desired computations.

I’ll greatly appreciate it if someone could help me find out why this doesn’t work. Thanks.