I am trying to launch a kernel with 3D threads. I have an array of N elements, where each element is a 2D array. I want to process this array on cuda. Now, if I will launch a 2D kernel, I will have to have a for loop of count N to process each element of the array. Now, I want to eliminate the for loop by using 3D threads.

2D threads are launched and indexed in the following manner:
Launching:
dim3 dimBlock(16, 16, 1);
dim3 dimGrid(cuiDivUp(sizex, dimBlock.x), cuiDivUp(sizey, dimBlock.y), 1);
Kernel<<< dimGrid, dimBlock >>>();

Indexing:
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

Since grids can only have 2D blocks, I cannot make the following:
dim3 dimBlock(8, 8, 4);
dim3 dimGrid(cuiDivUp(sizex, dimBlock.x), cuiDivUp(sizey, dimBlock.y), cuiDivUp(sizez, dimBlock.z));
Kernel<<< dimGrid, dimBlock >>>();

What is the right procedure for indexing 3D threads?

I don’t know that there is a “right” procedure. You can convert any arbitrary thread index/block index combination into a “global” index. For your 2D case like this:

``````unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;

unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

unsigned int tidx = x + y * gridDim.x;
``````

tidx is the “global” index of the thread on the grid. You can then convert that “global” index into rectilinear coordinates of whatever dimensionality you want. For example, in column major order for a 3D grid of dimensions (dX,dY,dZ), in pseudo code:

``````i = tidx mod dX

j = (tidx-x)/dX mod dY

k=(((tidx-x)/dX)-y)/dimY
``````

returns the equivalent coordinates (i,j,k). Similar calculations can be derived for any dimensionality you like, in either column major or row major order.

I use the following code, found here in the forum

Kernel launch

dim3 dimGrid = dim3(blocksInX, blocksInY*blocksInZ);