Hi,
I am facing a very weird problem:
I wrote an Image proccessing piece of code that requires few dozens of iterations. The original WORKING piece of code is designed in a way that the kernel executes one iteration at a time while the loop is in the host part of the code. In order to speed it up, I tried to move the loop to the kernel.
The problem is that I got different results although the code is supposed to do the same.
This is the modified code:
__global__ void
AnisoTex( float* g_odata, int width, int height, int NumOfFrames, enum KappaCalcFunc mode,
float lambda, float LinearKappaVal, float minKappa, float maxKappa, int normalizeVal, int iters)
{
float kappa;
int x = blockIdx.x*blockDim.x + threadIdx.x;
int y = blockIdx.y*blockDim.y + threadIdx.y;
//for(int i=0; i<iters; i++){
kappa = LinearKappaVal;
float f0p01=(x>=0 && x<width && y-1>=0 && y-1<height)? g_odata[(y-1)*width+x]:0;
float f0p10=(x-1>=0 && x-1<width && y>=0 && y<height)? g_odata[y*width+x-1]:0;
float f0p11=g_odata[y*width+x];
float f0p12=(x+1>=0 && x+1<width && y>=0 && y<height)? g_odata[y*width+x+1]:0;
float f0p21=(x>=0 && x<width && y+1>=0 && y+1<height)? g_odata[(y+1)*width+x]:0;
float res = AnisoTropicDiff(f0p01, f0p10, f0p11, f0p12, f0p21,kappa, lambda, normalizeVal);
__syncthreads();
g_odata[y*width + x] = res;
// __syncthreads();
//}
} // end of AnisoTex()
In this code we read the pixels from the array which is on the device and do pixels manipulations on that array.
Note that the inner loop in the above code is commented.
The way the kernel is called is:
void executeKernel() {
dim3 dimBlock(16, 12, 1);
dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
// execute the kernel
for (int i=0; i<iterNum;i++){
AnisoTex<<< dimGrid, dimBlock, 0 >>>( d_data_f0, width, height, numOfFrames, kappaMode, m_lambda,
m_LinearKappaVal,m_minKappaVal, m_maxKappaVal, m_normalizeVal,
iterNum);
CUDA_SAFE_CALL( cudaThreadSynchronize() );
// check if kernel execution generated an error
CUT_CHECK_ERROR("Kernel execution failed");
}
printf("Execution of kernel ended\n");
} // end of executeKernel()
Calling the kernel this way produces the expected results. If I uncomment the inner loop in the kernel (and uncomment the last syncthreads), and comment the loop in the above function, I get strange results.
Attached are the examples of both results, although the difference seems to be small, it accumulates as more iterations are added, and hurts the final result.
This is the bad result (iterations inside the kernel):
[attachment=5245:attachment]
This is the good result (iterations outside the kernel):
[attachment=5246:attachment]
If anyone has a clue I’ll be grateful since I don’t have any idea what is the reason for this bizzare problem.