Hi,

I implemented a simple gaussian filter which runs fine under emulation mode but does nothing in real mode.

I try to describe the smallest failing part of the filter. The filter-kernel works row-wise on the image graylevels (dimension nx*ny). For this purpose I copy first the

incoming image h_bm into device memory d_bm. The kernel gets then the image with its dimensions and some constants b[0]…b[4].

After the kernel is finished I copy the device memory to h_res and return it. The behaviour I get when running NOT in emulation mode is, that in the end I have in

h_res *exactly* the same image except of the very first array element which differs in 0.0000610352 from the original image.

I guess that this means:

a: The kernel does *something*, since the difference is not there if I don’t start the kernel at all.

b: I probably screwed up memory assignment or copy which doesn’t come out in emulation mode.

Here is the code-snip for the kernel-call

[codebox]

void cudaGaussianFilter(double *h_bm, long n, int nx, int ny, int nc, double sigma){

double *d_bm,

```
*d_bm_transposed,
*h_res; /* The array where the result is stored */
```

double q; /* An adapted version of the sigma */

size_t stride,stride_tr; /* The stride I have to use instead of ny */

int blocksize, /* How many thread per block */

gridsize; /* How many blocks in all */

blocksize = NUM_OF_THREADS;

gridsize = (ny%blocksize==0)?ny/blocksize:ny/blocksize+1; /* make enough blocks if

ny is not divisible by the choosen |blocksize| which is usually 32*/

@<Calculate the parameter for the gaussian filter@>@;

CUDA_SAFE_CALL(cudaMallocPitch((void **) &d_bm,&stride,nx*sizeof(double),ny) );

CUDA_SAFE_CALL(

```
cudaMemcpy2D((void*)d_bm,stride,(void*)h_bm, nx*sizeof(double),
nx*sizeof(double),ny,cudaMemcpyHostToDevice)
```

);

h_res = (double *) calloc(sizeof(double), nx*ny);

cudaGaussKernel<<<gridsize,blocksize>>>(d_bm,stride,nx,ny,nc,b[0],b[1],b[2],b[3],b[4]);

CUDA_SAFE_CALL(

```
cudaMemcpy2D((void*) h_res, nx*sizeof(double),(void*) d_bm, stride,
nx*sizeof(double), ny, cudaMemcpyDeviceToHost)
```

);

[/codebox]

And here is what my kernel does. I have as many threads as there are rows in my bitmap. Since every thread calculates one line (row) of the image I should have no memory clashes and every thread can access one line of the memory like it wants.

[codebox]

**global** void cudaGaussKernel(

double *d_bm,

size_t stride,

int nx,

int ny,

int nc,

double b0, double b1, double b2, double b3, double B){

```
int pos = blockIdx.x * blockDim.x + threadIdx.x;
if(pos >= ny) return;
```

int n = 0; /* The position in the row */

```
double pV; /* The value which is used for padding */
```

/* Forward iteration. Calculating the boundary-elements by hand. */

```
double* in = (double*) ((char*)d_bm+pos*stride);
pV = in[n];
in[n++] = B*pV+(b1*pV+b2*pV+b3*pV)/b0;
in[n++] = B*in[n]+(b1*in[n-1]+b2*pV+b3*pV)/b0;
in[n++] = B*in[n]+(b1*in[n-1]+b2*in[n-2]+b3*pV)/b0;
pV = in[n+nx-1];
while(n<nx){
in[n++] = B*in[n]+(b1*in[n-1]+b2*in[n-2]+b3*in[n-3])/b0;
}
n--; /* Going back to the last element in the row */
```

/* Backward iteration. Calculating the boundary-elements by hand. */

```
in[n--] = B*in[n]+(b1*pV+b2*pV+b3*pV)/b0;
in[n--] = B*in[n]+(b1*in[n+1]+b2*pV+b3*pV)/b0;
in[n--] = B*in[n]+(b1*in[n+1]+b2*in[n+2]+b3*pV)/b0;
```

while(n>=0){

```
in[n--] = B*in[n]+(b1*in[n+1]+b2*in[n+2]+b3*in[n+3])/b0;
}
```

}

[/codebox]

Currently, I have no idea how I can find the bug. Can anyone give me a hint? I already checked the kernel result with CUT_CHECK_ERROR… nothing.

Cheers

Patrick