Hi,

I have a running kernel where I tried to increase number of threads that should be calculated. As I pass the limit of blocks in one direction, I want to set up a 2D grid. However it doesn’t work, I guess I forget a detail, but I can’t find it out what.

I want to launch 192*192*175 threads, so 6,451,200 threads. According to the information out of the deviceQuerry example I can launch 512*65535*65535, so I am way under this limit.

My kernel call looks like that:

```
#define n_bins (192*192*175)
...
int numThreadsPerBlock=64;
dim3 blocks(320,320);
pre_sum_calc<<<blocks, numThreadsPerBlock>>>(n_bins, xy_coords_d, z_coords_d, pre_sum_d,
BIN_events_d, y_i_d, y_k_d, x_j_d, x_back_d);
cudaThreadSynchronize();
checkCUDAError("pre_sum_calc invocation");
```

I calculate the thread ID in the kernel like that:

```
__global__ void pre_sum_calc(int n_bins, xy_struct* xy_coords, z_struct* z_coords,
float* pre_sum, int* BIN_events, int* y_i, float* y_k, float* x_j, float* x_back)
{
int tID = blockIdx.x*blockDim.x+blockIdx.y*gridDim.x*blockDim.x + threadIdx.x; //threadID
int calc_type=0;
if (tID<n_bins){
int BIN = BIN_events[tID];
get_BIN_data(BIN, calc_type, xy_coords, z_coords, pre_sum, y_i, y_k, x_j, x_back); //call to a device funtion
}
}
```

So, I tired several sizes of each dimension of the grid, but the emulation mode always stops at the same thread ID = 1624064 . Device release returns an unspecific launch error.

Does anyone see what I am doing wrong here?

Thanks!