I want to use a 3 dimensional threads in one of my image processing application.

The output image is 512X512X128.

Here is what I am trying to do.

blocks.x = 8, blocks.y = 8, blocks.z = 8

grids.x = 512/8, grids.y = 512/8, grids.z = 128/8.

The kernel in this case times out as the total time taken by it it more than few seconds.

But when I comment grids.z, it takes the default value (1) and kernel runs fine.

To my surprise, the output is also correct.

Please let me know if my understanding of 3-D grids is fine or I am missing something.

I am using a GTX660 (Kepler 3.0 ) and CUDA 5.0.