3 D thread blocks in a cuda kernel

I want to use a 3 dimensional threads in one of my image processing application.
The output image is 512X512X128.
Here is what I am trying to do.
blocks.x = 8, blocks.y = 8, blocks.z = 8
grids.x = 512/8, grids.y = 512/8, grids.z = 128/8.
The kernel in this case times out as the total time taken by it it more than few seconds.
But when I comment grids.z, it takes the default value (1) and kernel runs fine.
To my surprise, the output is also correct.

Please let me know if my understanding of 3-D grids is fine or I am missing something.
I am using a GTX660 (Kepler 3.0 ) and CUDA 5.0.

How does Your kernel code looks like? It seems that although You define 3D execution grid You use only two dimentions.

MK

You are probably looking at results that are still around from the timed out kernel launch. CUDA doesn’t clear memory between runs.

Also if your kernel is timing out (and takes that long) and you’re in Windows, disable TDR.

Thanks for all the reply.
I have converted the code to 2D thread-blocks and launching it for 128 times.
It seems like when I specify grids.z = 1 and blocks.z = 2,
I can reduce the number of calls from 128 to 64.

Thanks :)