Odd block dimension behaviour

I have written a 3d lattice boltzmann fluid flow solver but am having problems with the grid and block dimensions. This is my first attempt at a 3d problem so I decided to keep things simple in this instance.

The code for setting the grid/block dims is:

dim3 Db = dim3(nx,1,1);

    dim3 Dg = dim3(ny,nz,1);

where nx, ny and nz are the domain dimensions in x y and z respectivly. The kernel is then called with something like:

kernel<<<Dg, Db>>>(arguments);

x y and z coordinates are then recovered within the kernel with:

int x     = threadIdx.x;

    int y     = blockIdx.x;

    int z     = blockIdx.y;

Now I am aware that this definitely isn’t the most efficient set up, where a better alternative would be to ensure each block consists of a multiple of 32 threads. The problem I have is that when nx > 128 (threads per block > 128) the program completes almost instantaneously with no errors or crashing, however the results are not at all what is expected.

The program actually consists of two kernels, one to initialise the data and one to execute the simulation. Interestingly it appears as though the initialisation kernel executes correctly as the results match the initialisation values as though the simulation kernel was never applied. This behaviour disappears completely with nx < 128 (threads per block < 128)

Have any of you encountered this sort of behaviour before?

How do you check for kernel errors? Seems very much like the kernel aborts or doesn’t launch.

How are x,y and z used ?

I’m not currently error checking, though I think that will be my project for this morning.

x, y and z are used to calculate the index from which data is loaded and the target to which data is written.

Edit: thanks for the heads up on error checking, you are absolutley correct. The kernel launch fails with the error "too many resources requested for launch, so I guess I have to cut down on my register usage and use more efficient grid and block dimensions.