loop execution inside kenel VS outside of it

aviad · January 15, 2008, 6:48pm

Hi,

I am facing a very weird problem:

I wrote an Image proccessing piece of code that requires few dozens of iterations. The original WORKING piece of code is designed in a way that the kernel executes one iteration at a time while the loop is in the host part of the code. In order to speed it up, I tried to move the loop to the kernel.

The problem is that I got different results although the code is supposed to do the same.

This is the modified code:

__global__ void 

AnisoTex( float* g_odata, int width, int height, int NumOfFrames, enum KappaCalcFunc mode, 

   float lambda, float LinearKappaVal, float minKappa, float maxKappa, int normalizeVal, int iters)

{ 

	float kappa;

	int x = blockIdx.x*blockDim.x + threadIdx.x;

	int y = blockIdx.y*blockDim.y + threadIdx.y;

	//for(int i=0; i<iters; i++){

  kappa = LinearKappaVal;

  float f0p01=(x>=0 && x<width && y-1>=0 && y-1<height)? g_odata[(y-1)*width+x]:0;

  float f0p10=(x-1>=0 && x-1<width && y>=0 && y<height)? g_odata[y*width+x-1]:0;

  float f0p11=g_odata[y*width+x];

  float f0p12=(x+1>=0 && x+1<width && y>=0 && y<height)? g_odata[y*width+x+1]:0;

  float f0p21=(x>=0 && x<width && y+1>=0 && y+1<height)? g_odata[(y+1)*width+x]:0;

  float res = AnisoTropicDiff(f0p01, f0p10, f0p11, f0p12, f0p21,kappa, lambda, normalizeVal);

  __syncthreads();

  g_odata[y*width + x] = res;

	//	__syncthreads();

	//}

}	// end of AnisoTex()

In this code we read the pixels from the array which is on the device and do pixels manipulations on that array.

Note that the inner loop in the above code is commented.

The way the kernel is called is:

void executeKernel() {

	dim3 dimBlock(16, 12, 1); 

	dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);

	// execute the kernel

	for (int i=0; i<iterNum;i++){

	AnisoTex<<< dimGrid, dimBlock, 0 >>>( d_data_f0,  width, height, numOfFrames, kappaMode, m_lambda, 

      	m_LinearKappaVal,m_minKappaVal, m_maxKappaVal, m_normalizeVal,

          	iterNum);

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

  // check if kernel execution generated an error

	CUT_CHECK_ERROR("Kernel execution failed");

	}

	printf("Execution of kernel ended\n");

}	// end of executeKernel()

Calling the kernel this way produces the expected results. If I uncomment the inner loop in the kernel (and uncomment the last syncthreads), and comment the loop in the above function, I get strange results.

Attached are the examples of both results, although the difference seems to be small, it accumulates as more iterations are added, and hurts the final result.

This is the bad result (iterations inside the kernel):

[attachment=5245:attachment]

This is the good result (iterations outside the kernel):

[attachment=5246:attachment]

If anyone has a clue I’ll be grateful since I don’t have any idea what is the reason for this bizzare problem.

DenisR · January 15, 2008, 7:18pm

The syncthreads in your kernel code only synchronizes the threads in 1 block, while calling the kernel in a loop makes sure all blocks have been run before starting the next iteration.

So I guess that pixels from 1 block depend on values from pixels in another block (although I did not check that in your code, but the wrong result looks blocky)

That is the reason why you get different results.

aviad · January 15, 2008, 7:48pm

Thanks! You’re right that some pixels are dependent on pixels from other blocks.

I thought that __syncthreds() should do the trick and indeed forgot that it affects only 1 block.

So is there a way to synchronize all threads and not only threads from the same block?

Thanks again!

DenisR · January 15, 2008, 8:02pm

No you can only synchronize all threads in one block. For example, a card with only 6 stream processors can run 3*6 = 18 blocks concurrently. So if you would have a need for more than 18 blocks there is no way to synchronize, since later blocks cannot run until the first blocks are done.

So I am afraid that you can only:

-loop in the kernel over all of the image (and thus running only 1 block) and loop numIter times

-call the kernel in the loop like you did (and thus have the overhead of calling the kernel numIter times)

And the second option will most likely always be the fastest.

You have btw. still another error in your code. You should create a separate output array for your kernel and switch arrays in the for-loop like so:

__global__ void

AnisoTex( float* g_outdata, float* g_odata, int width, int height, int NumOfFrames, enum KappaCalcFunc mode,

  float lambda, float LinearKappaVal, float minKappa, float maxKappa, int normalizeVal, int iters)

{

float kappa;

int x = blockIdx.x*blockDim.x + threadIdx.x;

int y = blockIdx.y*blockDim.y + threadIdx.y;

kappa = LinearKappaVal;

 float f0p01=(x>=0 && x<width && y-1>=0 && y-1<height)? g_odata[(y-1)*width+x]:0;

 float f0p10=(x-1>=0 && x-1<width && y>=0 && y<height)? g_odata[y*width+x-1]:0;

 float f0p11=g_odata[y*width+x];

 float f0p12=(x+1>=0 && x+1<width && y>=0 && y<height)? g_odata[y*width+x+1]:0;

 float f0p21=(x>=0 && x<width && y+1>=0 && y+1<height)? g_odata[(y+1)*width+x]:0;

 float res = AnisoTropicDiff(f0p01, f0p10, f0p11, f0p12, f0p21,kappa, lambda, normalizeVal);

 __syncthreads();

 g_outdata[y*width + x] = res;

} // end of AnisoTex()

void executeKernel() {

dim3 dimBlock(16, 12, 1);

dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);

// execute the kernel

for (int i=0; i<iterNum;i++){

AnisoTex<<< dimGrid, dimBlock, 0 >>>( d_out_data, d_data_f0,  width, height, numOfFrames, kappaMode, m_lambda,

      m_LinearKappaVal,m_minKappaVal, m_maxKappaVal, m_normalizeVal,

          iterNum);

float *tmp = d_data_f0;

d_data_f0 = d_out_data;

d_out_data = tmp;

CUDA_SAFE_CALL( cudaThreadSynchronize() );

 // check if kernel execution generated an error

CUT_CHECK_ERROR("Kernel execution failed");

}

printf("Execution of kernel ended\n");

} // end of executeKernel()

Topic		Replies	Views
How can I be certain my Kernel runs with 32 threads in one block and thus perfect synchrony? (ie. via __syncthreads()) CUDA Programming and Performance	15	272	August 21, 2024
kernel execution in FOR loops CUDA Programming and Performance	8	5839	January 9, 2010
calling cudaThreadSynchronize from a kernel CUDA Programming and Performance	7	4777	September 2, 2009
Double For Loop Very Slow CUDA Programming and Performance	8	4577	August 20, 2008
Sequential loop inside kernel seems to be not working at all, or this is just a synchronization problem ?! CUDA Programming and Performance	2	664	November 6, 2016
Loop inside kernel or over kernels in host code? [performance question] CUDA Programming and Performance	8	6855	September 25, 2008
Synchronizing threads CUDA Programming and Performance	1	5962	March 21, 2007
Race condition in for loop Help! CUDA Programming and Performance	8	3391	September 10, 2008
The result is unpredictable. CUDA Programming and Performance	6	1163	October 25, 2013
execution of two kernels as one CUDA Programming and Performance	4	2046	February 9, 2009

loop execution inside kenel VS outside of it

Related topics