 # Parallell thinking for 3d convolution

I’m trying to learn CUDA and am relatively new to parallell programming.

I am trying to create a kernel that uses convolution on two 3d matrices. I got a working program which I will paste below just two give a basic idea of what I’m trying to do (it’s not really necessary to go through the code, but will paste it in case it’s gives a better explanation of what I’m trying to do).

As I’m kind of new to the parallell thinking modell I think my approach might be a little wrong and am turning here to get some pointers on how you should think when dividing the grid and blocks.

In my current solution I only divide the grid in x,y coordinates (as there is no z) and I also only use x and y when applying threads to the blocks and then for every thread run through every z element. The program will be handling really big matrices which of course can be separated in to small matrices but at the moment I just feel like I have to get in to the general thinking of dividing grids and block for 3d matrices.

Is there any one who knows of a better way to divide the grids and blocks when dealing with 3d matrices? On a general note the elements in the matrix is put in one long array.

Short summary of the code

``````Matrix A, B, C;

//Calculate block and grid size

//Define the grid- and block size

dim3 grid(gridsizex,gridsizey,1);

//Execute kernel

``````
``````__global__ void

MatBlurKernel(Matrix A, Matrix B, Matrix C)

{

//To calculate the curent index in the array for the (x,y,z) element

unsigned int index = 0;

//Calculate x and y index

const unsigned int y = ( blockIdx.y * blockDim.y ) + threadIdx.y;

const unsigned int x = ( blockIdx.x * blockDim.x ) + threadIdx.x;

// To calculate which indexes we will be using in matrix 2

const int iMax = (int)( B.width * 0.5 );

const int jMax = (int)( B.height * 0.5 );

const int kMax = (int)( B.depth * 0.5 );

// Loop through every z value of the current x and y value of the matrice.

for ( int z = 0; z < A.depth; z++)

{

float pValue = 0.0f;

//Calculate the index for the elements

index = ( z * A.width * A.height ) + ( x * A.width ) + ( y );

//Apply convolution

for( int i = -iMax; i <= iMax; i++ )

{

for( int j = -jMax; j <= jMax; j++ )

{

for ( int k = -kMax; k <= kMax; k++)

{

//Don't add elements that are outside of matrix 1

if( k + z >= 0 && k + z < A.depth &&

i + y >= 0  && i + y < A.width   &&

j + x >= 0  && j + x < A.height )

{

float Aelement = A.elements[ index ];

float Belement = B.elements[ (i + iMax) + (j + jMax) * B.width + (k + kMax) * B.width * B.height ];

pValue += Aelement * Belement;

}

}

}

}

C.elements[index] = pValue;

}

}
``````