syncthreads() in loop why does this work?

Hi -

I’ve got some code that wasn’t working with certain grid sizes until I put syncthreads() in. I’m not using any shared memory and only access global memory in the kernel. So originally I didn’t think I needed a syncthreads(). For my own edification, just wondering why adding syncthreads() causes it to work here.

The grid sizes this code was working for were for even multiples of 16. Odd multiples did not produce the correct result.

Threadblock size is: 128x1

compute_T() below only reads from d_pData and just performs some numeric calculations.

Simplified version of the code follows:

__device__ __constant__ float d_Grid[9];

__device__ __constant__ int d_DimI[3];

__global__

void myKernel ( float *d_pResult, float *d_pData, float *d_u1, float *d_u2, float *d_u3 )

{

    int kk = threadIdx.x + blockIdx.x*blockDim.x;

    int jj = threadIdx.y + blockIdx.y*blockDim.y;

   // foreach slice

    for( int ii=1;ii<d_DimI[X]-1;ii++ )

    {

       // foreach voxel in interior of grid 

        if( ii > 0 && ii < d_DimI[X]-1 && 

            jj > 0 && jj < d_DimI[Y]-1 &&

            kk > 0 && kk < d_DimI[Z]-1 )

        {

            // compute world space location

            float wx = ii*d_Grid[ GRID_DX ] - g_u1( ii,jj,kk );

            float wy = jj*d_Grid[ GRID_DY ] - g_u2( ii,jj,kk );

            float wz = kk*d_Grid[ GRID_DZ ] - g_u3( ii,jj,kk );

           g_Result( ii,jj,kk ) = compute_T( d_pData, wx, wy, wz );

         }

         __syncthreads();

    }

}
__device__ __constant__ float d_Grid[9];

__device__ __constant__ int d_DimI[3];

__global__

void myKernel ( float *d_pResult, float *d_pData, float *d_u1, float *d_u2, float *d_u3 )

{

 Â  Â int kk = threadIdx.x + blockIdx.x*blockDim.x;

 Â  Â int jj = threadIdx.y + blockIdx.y*blockDim.y;

  Â // foreach slice

 Â  Â for( int ii=1;ii<d_DimI[X]-1;ii++ )

 Â  Â {

  Â  Â  Â // foreach voxel in interior of grid 

 Â  Â  Â  Â if( ii > 0 && ii < d_DimI[X]-1 && 

 Â  Â  Â  Â  Â  Â jj > 0 && jj < d_DimI[Y]-1 &&

 Â  Â  Â  Â  Â  Â kk > 0 && kk < d_DimI[Z]-1 )

 Â  Â  Â  Â {

 Â  Â  Â  Â  Â  Â // compute world space location

 Â  Â  Â  Â  Â  Â float wx = ii*d_Grid[ GRID_DX ] - g_u1( ii,jj,kk );

 Â  Â  Â  Â  Â  Â float wy = jj*d_Grid[ GRID_DY ] - g_u2( ii,jj,kk );

 Â  Â  Â  Â  Â  Â float wz = kk*d_Grid[ GRID_DZ ] - g_u3( ii,jj,kk );

  Â  Â  Â  Â  Â g_Result( ii,jj,kk ) = compute_T( d_pData, wx, wy, wz );

 Â  Â  Â  Â  }

 Â  Â  Â  Â  __syncthreads();

 Â  Â }

}

This is just a sugestion for performance improvement:

Your check for JJ and KK inside the FOR loop is redundant!

JJ and KK do NOt change inside the FOR loop!

So, why check it again and again in the FOR loop?

btw, if you move it out – your __syncthreads() will hang - because some threads will never enter the FOR loop! You have to be careful – you need to decide!