What is the equation for the unique Block Index of a three dimensional grid.
All literature only needs two dimensions, but most of my kernels use a unique block index for accessing data and so they are scalable to the maximum the hardware allows. (theoretically, if video ram would be enough)
On compute capability 2.0 you have 6 indices you can play with. On less you have 5 indice to play. There are many ways.
For example define the grid(lz,ly,1) and the threads(lx,1,1) and in the kernel you will have
ix=threadIdx,x;
iy=blockIdx.y;
iz=blockIdx.x;
You can also submit like
threads(16,32,1) grid((lx+16)/16,(ly+32)/32,lz)
and in the kernel you will have
ix=blockIdx.x * blockDim.x + threadIdx.x;
iy=blockIdx.y * blockDim.y + threadIdx.y;
iz=blockIdx.z;
I depends a little on you card (because of the max number of threads per block) and the size of the matrix. SO please tell us if you card has compute 2 capability or not and the maximum/ typical size of the matrix you will use,
As not all threads can be active concurrently anyway, you can also put loops inside your kernels or around your kernel calls.
Also the/a uníque thread index can be used for indexing memory, but there are more than one way to create a unique index out of multidimensional numbers.
It makes sense to consider memory access patterns - coalescing, L1/L2 cache hit rate, shared memory bank conflict-free-ness - to create one (or several) unique id(s) optimized for your kernel.
int unique1 = blockIdx.x * blockDim.x + threadIdx.x;
int unique2 = threadIdx.x * gridDim.x + blockIdx.x;