What is the equation for the unique Block Index of a three dimensional grid.
All literature only needs two dimensions, but most of my kernels use a unique block index for accessing data and so they are scalable to the maximum the hardware allows. (theoretically, if video ram would be enough)
On compute capability 2.0 you have 6 indices you can play with. On less you have 5 indice to play. There are many ways.
For example define the grid(lz,ly,1) and the threads(lx,1,1) and in the kernel you will have
ix=threadIdx,x;
iy=blockIdx.y;
iz=blockIdx.x;
You can also submit like
threads(16,32,1) grid((lx+16)/16,(ly+32)/32,lz)
and in the kernel you will have
ix=blockIdx.x * blockDim.x + threadIdx.x;
iy=blockIdx.y * blockDim.y + threadIdx.y;
iz=blockIdx.z;
I depends a little on you card (because of the max number of threads per block) and the size of the matrix. SO please tell us if you card has compute 2 capability or not and the maximum/ typical size of the matrix you will use,