problem with '__syncthreads()'

Hi,

I’ve got problem with problem with the ‘__syncthreads()’ function. I think, I do not understand what exactly happens when I call it. I’ve noticed, that the next line after call of that function doesn’t work correctly.

This is simple kernel code, which applies 1D blur filter along the one axis of 3d Image:

__global__ void ex_blur_X_kenrel( float * d_input, float * d_output, float * dKernelData, int nKernelRad, uint nW, uint nH, uint nD, uint y )

{

	uint x, z;

	float fKVal = 0.0f;

	float fVal;

	

	__shared__ float a[ 128 ][ 2 ];

	x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

	z = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;

	if ((x >= nW) || (y >= nH) || (z >= nD) ) return; 

	uint idx = z * nW * nH + y * nW + x;

//read data into shared memory

	a[ threadIdx.x ][ threadIdx.y ] = d_input[ idx ];    

	__syncthreads();

//apply kernel:

	fVal  = a[ threadIdx.x ][ threadIdx.y ] * dKernelData[ 0 ];

	for ( int ix = 1 ; ix < nKernelRad; ix++ )

	{

		fKVal = dKernelData[ ix ];

		if( ( threadIdx.x  + ix ) < blockDim.x ) 

		{

			fVal += a[ threadIdx.x  + ix ][ threadIdx.y ] * fKVal;

		}

		if( ( (int)threadIdx.x - ix ) >= 0 ) 

		{

			fVal += a[ threadIdx.x  - ix ][ threadIdx.y ] * fKVal;

		}

	}

	d_output[ idx ] = fVal;

}

unfortunately in this case, the line:

fVal  = a[ threadIdx.x ][ threadIdx.y ] * dKernelData[ 0 ];

doesn’t work.

When I removed that line, in most cases everything was ok (sometimes there are artefacts causes by the lack of synchronization of course ).

You can find results in attachments.

I’ll be grateful for explanation and your ideas. I use cuda 4.0 and gf480.

Best regards,

Jakub
incorrect_with_syncthreads.png
correctCPU.png

I think this is because

if ((x >= nW) || (y >= nH) || (z >= nD) ) return;

some threads return and then behaviour of __syncthreads() is undefined.

try the following variant

__global__ void ex_blur_X_kenrel( float * d_input, float * d_output, float * dKernelData, int nKernelRad, uint nW, uint nH, uint nD, uint y )

{

    uint x, z;

    float fKVal = 0.0f;

    float fVal;

    __shared__ float a[ 128 ][ 2 ];

x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

    z = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;

bool invalid = ((x >= nW) || (y >= nH) || (z >= nD)) ;

//if ((x >= nW) || (y >= nH) || (z >= nD) ) return; 

uint idx = z * nW * nH + y * nW + x;

//read data into shared memory

    float val = 0.0f ;

    if (!invalid ){

        val = d_input[ idx ];

    }

    a[ threadIdx.x ][ threadIdx.y ] = val;

    __syncthreads();

//apply kernel:

    fVal  = a[ threadIdx.x ][ threadIdx.y ] * dKernelData[ 0 ];

for ( int ix = 1 ; ix < nKernelRad; ix++ )

    {

        fKVal = dKernelData[ ix ];

        if( ( threadIdx.x  + ix ) < blockDim.x ) 

        {

            fVal += a[ threadIdx.x  + ix ][ threadIdx.y ] * fKVal;

        }

        if( ( (int)threadIdx.x - ix ) >= 0 ) 

        {

            fVal += a[ threadIdx.x  - ix ][ threadIdx.y ] * fKVal;

        }

    }

    if (!invalid){

        d_output[ idx ] = fVal;

    }

}

Thank you LSChien for pointing me the problem with threads return and __syncthreads().
Unfortunately it did not solve the problem I described.
best regards,
Jakub