Hi all,

I’m implementing an image processing program. I’m having a big trouble with my GPU program.

Input data: 5 array: bin1(2MB short), bin2(2MB short), atn1(2MB float), atn2(2MB float), rat (64K float)

Output: f_est (64K float) (d_est in GPU)

Folowing is my CPU program:

[codebox]

```
for(i=0; i<IMGSIZ; i++)
{
for(j=0; j<IMGSIZ; j++)
{
_norm = 0;
temp = 0;
for (s=0; s<ANGLES; s++)
{
p = i*IMGSIZ*ANGLES + j*ANGLES + s;
t1 = (int)*( bin1 + p);
t2 = (int)*( bin2 + p);
atnlen1 = *( atn1 + p);
atnlen2 = *( atn2 + p);
_norm = _norm + (atnlen1 + atnlen2);
if( t1 != 0 )
temp += atnlen1 * rat[s][t1];
if( t2 != 0 )
temp += atnlen2 * rat[s][t2];
}
if(_norm!=0)
f_est[i][j] = f_est[i][j]*temp/_norm;
}
}
```

[/codebox]

Above program takes only 42ms to complete

In GPU program, each thread will solve for only one pixel, the GPU program is described as follow:

[codebox]

**global** void GPU_Pro( float* d_est, float* d_rat,

```
short *d_bin1, short *d_bin2,
float *d_atnlen1, float *d_atnlen2)
```

{

```
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
const unsigned int base = ix*IMGSIZ*ANGLES + iy*ANGLES;
float _norm, temp, atnlen1, atnlen2;
int s, t1, t2;
unsigned int p;
unsigned int _ad = ix*IMGSIZ + iy;
unsigned int id1, id2;
_norm = 0;
temp = 0;
for (s=0; s<ANGLES; s++)
{
p = base + s;
t1 = (int)d_bin1[p];
t2 = (int)d_bin2[p];
atnlen1 = d_atnlen1[p];
atnlen2 = d_atnlen2[p];
id1 = __mul24(s,ANGLES) + t1;
id2 = __mul24(s,ANGLES) + t2;
_norm = _norm + __fadd_rz(atnlen1, atnlen2);
if(t1 != 0)
temp += __fmul_rz(atnlen1, tex1Dfetch(tex_rat, id1));
if(t2 != 0)
temp += __fmul_rz(atnlen2, tex1Dfetch(tex_rat, id2));
}
float _val = __fmul_rz(d_est[_ad], temp)/_norm;
d_est[_ad] = _val;
```

}

[/codebox]

rat : has been mapped to texture memory.

d_bin1, d_bin2. d_atnlen1,d_atnlen2: Global momory.

Naturally, it must be run faster than CPU program (in my case: Gridsize = 256, Blocksize = 64). But when I perform, it took me: 46ms to complete.In my case, each thread access independent data (I mean there are no area of using shared momory). I’ve thought about non-coalecing problem, but I didn’t know how to solve these problem in my program.

I’ve performed the test with all texture map (5 array will be map to texture memory – in kernel program, I access texture momory instead of global memory), However, it’s run slower than global memory case.

(I use CUT_SAFE_CALL(cutStartTimer(timer)) / CUT_SAFE_CALL(cutStopTimer(timer))

to estimate the calculation time for GPU program. The result is ~ the value in cuda_profile.log)

I’ve found that each 128 operations to read global memory took me 8ms. This is an amazing speed in compare with what have been showed in banwidthtest in SDK. I don’t know why.

Anyone can help me?

I’m looking forward to hearing from you

Thanks in advance

Computer: Intel Duo-Core 1.8GHz, 1GB RAM, PCIe 1X, GPU: 8600GT 512MB.