loop execution inside kenel VS outside of it

Hi,

I am facing a very weird problem:

I wrote an Image proccessing piece of code that requires few dozens of iterations. The original WORKING piece of code is designed in a way that the kernel executes one iteration at a time while the loop is in the host part of the code. In order to speed it up, I tried to move the loop to the kernel.

The problem is that I got different results although the code is supposed to do the same.

This is the modified code:

__global__ void 

AnisoTex( float* g_odata, int width, int height, int NumOfFrames, enum KappaCalcFunc mode, 

   float lambda, float LinearKappaVal, float minKappa, float maxKappa, int normalizeVal, int iters)

{ 

	float kappa;

	int x = blockIdx.x*blockDim.x + threadIdx.x;

	int y = blockIdx.y*blockDim.y + threadIdx.y;

	//for(int i=0; i<iters; i++){

  kappa = LinearKappaVal;

  float f0p01=(x>=0 && x<width && y-1>=0 && y-1<height)? g_odata[(y-1)*width+x]:0;

  float f0p10=(x-1>=0 && x-1<width && y>=0 && y<height)? g_odata[y*width+x-1]:0;

  float f0p11=g_odata[y*width+x];

  float f0p12=(x+1>=0 && x+1<width && y>=0 && y<height)? g_odata[y*width+x+1]:0;

  float f0p21=(x>=0 && x<width && y+1>=0 && y+1<height)? g_odata[(y+1)*width+x]:0;

  float res = AnisoTropicDiff(f0p01, f0p10, f0p11, f0p12, f0p21,kappa, lambda, normalizeVal);

  __syncthreads();

  g_odata[y*width + x] = res;

	//	__syncthreads();

	//}

}	// end of AnisoTex()

In this code we read the pixels from the array which is on the device and do pixels manipulations on that array.

Note that the inner loop in the above code is commented.

The way the kernel is called is:

void executeKernel() {

	dim3 dimBlock(16, 12, 1); 

	dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);

	// execute the kernel

	for (int i=0; i<iterNum;i++){

	AnisoTex<<< dimGrid, dimBlock, 0 >>>( d_data_f0,  width, height, numOfFrames, kappaMode, m_lambda, 

      	m_LinearKappaVal,m_minKappaVal, m_maxKappaVal, m_normalizeVal,

          	iterNum);

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

  // check if kernel execution generated an error

	CUT_CHECK_ERROR("Kernel execution failed");

	}

	printf("Execution of kernel ended\n");

}	// end of executeKernel()

Calling the kernel this way produces the expected results. If I uncomment the inner loop in the kernel (and uncomment the last syncthreads), and comment the loop in the above function, I get strange results.

Attached are the examples of both results, although the difference seems to be small, it accumulates as more iterations are added, and hurts the final result.

This is the bad result (iterations inside the kernel):

[attachment=5245:attachment]

This is the good result (iterations outside the kernel):

[attachment=5246:attachment]

If anyone has a clue I’ll be grateful since I don’t have any idea what is the reason for this bizzare problem.
good.jpg
bad.jpg

The syncthreads in your kernel code only synchronizes the threads in 1 block, while calling the kernel in a loop makes sure all blocks have been run before starting the next iteration.

So I guess that pixels from 1 block depend on values from pixels in another block (although I did not check that in your code, but the wrong result looks blocky)

That is the reason why you get different results.

Thanks! You’re right that some pixels are dependent on pixels from other blocks.

I thought that __syncthreds() should do the trick and indeed forgot that it affects only 1 block.

So is there a way to synchronize all threads and not only threads from the same block?

Thanks again!

No you can only synchronize all threads in one block. For example, a card with only 6 stream processors can run 3*6 = 18 blocks concurrently. So if you would have a need for more than 18 blocks there is no way to synchronize, since later blocks cannot run until the first blocks are done.

So I am afraid that you can only:

-loop in the kernel over all of the image (and thus running only 1 block) and loop numIter times

-call the kernel in the loop like you did (and thus have the overhead of calling the kernel numIter times)

And the second option will most likely always be the fastest.

You have btw. still another error in your code. You should create a separate output array for your kernel and switch arrays in the for-loop like so:

__global__ void

AnisoTex( float* g_outdata, float* g_odata, int width, int height, int NumOfFrames, enum KappaCalcFunc mode,

  float lambda, float LinearKappaVal, float minKappa, float maxKappa, int normalizeVal, int iters)

{

float kappa;

int x = blockIdx.x*blockDim.x + threadIdx.x;

int y = blockIdx.y*blockDim.y + threadIdx.y;

kappa = LinearKappaVal;

 float f0p01=(x>=0 && x<width && y-1>=0 && y-1<height)? g_odata[(y-1)*width+x]:0;

 float f0p10=(x-1>=0 && x-1<width && y>=0 && y<height)? g_odata[y*width+x-1]:0;

 float f0p11=g_odata[y*width+x];

 float f0p12=(x+1>=0 && x+1<width && y>=0 && y<height)? g_odata[y*width+x+1]:0;

 float f0p21=(x>=0 && x<width && y+1>=0 && y+1<height)? g_odata[(y+1)*width+x]:0;

 float res = AnisoTropicDiff(f0p01, f0p10, f0p11, f0p12, f0p21,kappa, lambda, normalizeVal);

 __syncthreads();

 g_outdata[y*width + x] = res;

} // end of AnisoTex()

void executeKernel() {

dim3 dimBlock(16, 12, 1);

dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);

// execute the kernel

for (int i=0; i<iterNum;i++){

AnisoTex<<< dimGrid, dimBlock, 0 >>>( d_out_data, d_data_f0,  width, height, numOfFrames, kappaMode, m_lambda,

      m_LinearKappaVal,m_minKappaVal, m_maxKappaVal, m_normalizeVal,

          iterNum);

float *tmp = d_data_f0;

d_data_f0 = d_out_data;

d_out_data = tmp;

CUDA_SAFE_CALL( cudaThreadSynchronize() );

 // check if kernel execution generated an error

CUT_CHECK_ERROR("Kernel execution failed");

}

printf("Execution of kernel ended\n");

} // end of executeKernel()