I’m trying to learn CUDA and am relatively new to parallell programming.

I am trying to create a kernel that uses convolution on two 3d matrices. I got a working program which I will paste below just two give a basic idea of what I’m trying to do (it’s not really necessary to go through the code, but will paste it in case it’s gives a better explanation of what I’m trying to do).

As I’m kind of new to the parallell thinking modell I think my approach might be a little wrong and am turning here to get some pointers on how you should think when dividing the grid and blocks.

In my current solution I only divide the grid in x,y coordinates (as there is no z) and I also only use x and y when applying threads to the blocks and then for every thread run through every z element. The program will be handling really big matrices which of course can be separated in to small matrices but at the moment I just feel like I have to get in to the general thinking of dividing grids and block for 3d matrices.

Is there any one who knows of a better way to divide the grids and blocks when dealing with 3d matrices? On a general note the elements in the matrix is put in one long array.

**Short summary of the code**

```
Matrix A, B, C;
//Calculate block and grid size
int threadsize = 16;
int gridsizex = A.width/threadsize;
int gridsizey = A.height/threadsize;
//Define the grid- and block size
dim3 grid(gridsizex,gridsizey,1);
dim3 threads(threadsize, threadsize, 1);
//Execute kernel
MatBlurKernel<<<grid, threads >>>(A, B, C);
```

```
__global__ void
MatBlurKernel(Matrix A, Matrix B, Matrix C)
{
//To calculate the curent index in the array for the (x,y,z) element
unsigned int index = 0;
//Calculate x and y index
const unsigned int y = ( blockIdx.y * blockDim.y ) + threadIdx.y;
const unsigned int x = ( blockIdx.x * blockDim.x ) + threadIdx.x;
// To calculate which indexes we will be using in matrix 2
const int iMax = (int)( B.width * 0.5 );
const int jMax = (int)( B.height * 0.5 );
const int kMax = (int)( B.depth * 0.5 );
// Loop through every z value of the current x and y value of the matrice.
for ( int z = 0; z < A.depth; z++)
{
float pValue = 0.0f;
//Calculate the index for the elements
index = ( z * A.width * A.height ) + ( x * A.width ) + ( y );
//Apply convolution
for( int i = -iMax; i <= iMax; i++ )
{
for( int j = -jMax; j <= jMax; j++ )
{
for ( int k = -kMax; k <= kMax; k++)
{
//Don't add elements that are outside of matrix 1
if( k + z >= 0 && k + z < A.depth &&
i + y >= 0 && i + y < A.width &&
j + x >= 0 && j + x < A.height )
{
float Aelement = A.elements[ index ];
float Belement = B.elements[ (i + iMax) + (j + jMax) * B.width + (k + kMax) * B.width * B.height ];
pValue += Aelement * Belement;
}
}
}
}
C.elements[index] = pValue;
}
}
```