Parallell thinking for 3d convolution

I’m trying to learn CUDA and am relatively new to parallell programming.

I am trying to create a kernel that uses convolution on two 3d matrices. I got a working program which I will paste below just two give a basic idea of what I’m trying to do (it’s not really necessary to go through the code, but will paste it in case it’s gives a better explanation of what I’m trying to do).

As I’m kind of new to the parallell thinking modell I think my approach might be a little wrong and am turning here to get some pointers on how you should think when dividing the grid and blocks.

In my current solution I only divide the grid in x,y coordinates (as there is no z) and I also only use x and y when applying threads to the blocks and then for every thread run through every z element. The program will be handling really big matrices which of course can be separated in to small matrices but at the moment I just feel like I have to get in to the general thinking of dividing grids and block for 3d matrices.

Is there any one who knows of a better way to divide the grids and blocks when dealing with 3d matrices? On a general note the elements in the matrix is put in one long array.

Short summary of the code

Matrix A, B, C;

//Calculate block and grid size

int threadsize = 16;

int gridsizex = A.width/threadsize;

int gridsizey = A.height/threadsize;


//Define the grid- and block size

dim3 grid(gridsizex,gridsizey,1);

dim3 threads(threadsize, threadsize, 1);

//Execute kernel

MatBlurKernel<<<grid, threads >>>(A, B, C);
__global__ void

MatBlurKernel(Matrix A, Matrix B, Matrix C) 


        //To calculate the curent index in the array for the (x,y,z) element

	unsigned int index = 0;	


        //Calculate x and y index

  const unsigned int y = ( blockIdx.y * blockDim.y ) + threadIdx.y;

	const unsigned int x = ( blockIdx.x * blockDim.x ) + threadIdx.x;


        // To calculate which indexes we will be using in matrix 2

	const int iMax = (int)( B.width * 0.5 );

	const int jMax = (int)( B.height * 0.5 );

	const int kMax = (int)( B.depth * 0.5 );


	// Loop through every z value of the current x and y value of the matrice.

	for ( int z = 0; z < A.depth; z++)


   float pValue = 0.0f;


  //Calculate the index for the elements

  index = ( z * A.width * A.height ) + ( x * A.width ) + ( y );


  //Apply convolution

  for( int i = -iMax; i <= iMax; i++ )


  	for( int j = -jMax; j <= jMax; j++ )


    for ( int k = -kMax; k <= kMax; k++)


                                        //Don't add elements that are outside of matrix 1

    	if( k + z >= 0 && k + z < A.depth &&

         i + y >= 0  && i + y < A.width   &&

         j + x >= 0  && j + x < A.height )


      float Aelement = A.elements[ index ];

      float Belement = B.elements[ (i + iMax) + (j + jMax) * B.width + (k + kMax) * B.width * B.height ];

      pValue += Aelement * Belement;





  C.elements[index] = pValue;