I am trying to launch a kernel with 3D threads. I have an array of N elements, where each element is a 2D array. I want to process this array on cuda. Now, if I will launch a 2D kernel, I will have to have a for loop of count N to process each element of the array. Now, I want to eliminate the for loop by using 3D threads.
2D threads are launched and indexed in the following manner:
Launching:
[codebox]
dim3 dimBlock(16, 16, 1);
dim3 dimGrid(cuiDivUp(sizex, dimBlock.x), cuiDivUp(sizey, dimBlock.y), 1);
Kernel<<< dimGrid, dimBlock >>>();
[/codebox]
Indexing:
[codebox]
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
[/codebox]
Since grids can only have 2D blocks, I cannot make the following:
[codebox]
dim3 dimBlock(8, 8, 4);
dim3 dimGrid(cuiDivUp(sizex, dimBlock.x), cuiDivUp(sizey, dimBlock.y), cuiDivUp(sizez, dimBlock.z));
Kernel<<< dimGrid, dimBlock >>>();
[/codebox]
What is the right procedure for indexing 3D threads?