Help me: Bad result between CPU vs GPU program.

Hi all,

I’m implementing an image processing program. I’m having a big trouble with my GPU program.

Input data: 5 array: bin1(2MB short), bin2(2MB short), atn1(2MB float), atn2(2MB float), rat (64K float)

Output: f_est (64K float) (d_est in GPU)

Folowing is my CPU program:

[codebox]

for(i=0; i<IMGSIZ; i++) 

{

	for(j=0; j<IMGSIZ; j++) 

	{

		_norm = 0;

	        	temp = 0;

		for (s=0; s<ANGLES; s++) 

		{

			p = i*IMGSIZ*ANGLES + j*ANGLES + s;

				t1 = (int)*( bin1 + p);

				t2 = (int)*( bin2 + p);

				atnlen1 = *( atn1 + p);

				atnlen2 = *( atn2 + p);

			_norm = _norm + (atnlen1 + atnlen2);

			if( t1 != 0 )

				temp += atnlen1 * rat[s][t1];

			if( t2 != 0 )

				temp += atnlen2 * rat[s][t2];

		}

		if(_norm!=0)

			f_est[i][j] = f_est[i][j]*temp/_norm;

	}

}

[/codebox]

Above program takes only 42ms to complete

In GPU program, each thread will solve for only one pixel, the GPU program is described as follow:

[codebox]

global void GPU_Pro( float* d_est, float* d_rat,

					short *d_bin1, short *d_bin2,

					 float *d_atnlen1, float *d_atnlen2)

{

const int ix = blockDim.x * blockIdx.x + threadIdx.x;

    const int iy = blockDim.y * blockIdx.y + threadIdx.y;

const unsigned int base = ix*IMGSIZ*ANGLES + iy*ANGLES;

float	_norm, temp, atnlen1, atnlen2;

int		s, t1, t2;

unsigned int p;

unsigned int _ad = ix*IMGSIZ + iy;

unsigned int id1, id2;

_norm	= 0;

temp	= 0;

for (s=0; s<ANGLES; s++) 

{		

	p = base + s;

		t1 = (int)d_bin1[p];

		t2 = (int)d_bin2[p];

		atnlen1 = d_atnlen1[p];

		atnlen2 = d_atnlen2[p];

		id1		= __mul24(s,ANGLES) + t1;

		id2		= __mul24(s,ANGLES) + t2;

	_norm = _norm + __fadd_rz(atnlen1, atnlen2);

	if(t1 != 0)

		temp += __fmul_rz(atnlen1,  tex1Dfetch(tex_rat, id1));

	if(t2 != 0)

		temp += __fmul_rz(atnlen2,  tex1Dfetch(tex_rat, id2));	

}	

float _val = __fmul_rz(d_est[_ad], temp)/_norm;

d_est[_ad] = _val;	

}

[/codebox]

rat : has been mapped to texture memory.

d_bin1, d_bin2. d_atnlen1,d_atnlen2: Global momory.

Naturally, it must be run faster than CPU program (in my case: Gridsize = 256, Blocksize = 64). But when I perform, it took me: 46ms to complete.In my case, each thread access independent data (I mean there are no area of using shared momory). I’ve thought about non-coalecing problem, but I didn’t know how to solve these problem in my program.

I’ve performed the test with all texture map (5 array will be map to texture memory – in kernel program, I access texture momory instead of global memory), However, it’s run slower than global memory case.

(I use CUT_SAFE_CALL(cutStartTimer(timer)) / CUT_SAFE_CALL(cutStopTimer(timer))

to estimate the calculation time for GPU program. The result is ~ the value in cuda_profile.log)

I’ve found that each 128 operations to read global memory took me 8ms. This is an amazing speed in compare with what have been showed in banwidthtest in SDK. I don’t know why.

Anyone can help me?

I’m looking forward to hearing from you

Thanks in advance

Computer: Intel Duo-Core 1.8GHz, 1GB RAM, PCIe 1X, GPU: 8600GT 512MB.