Working emulation program but failing gpu program How to do a bug search when the emulation runs fin

halirutan · December 7, 2008, 4:44am

Hi,

I implemented a simple gaussian filter which runs fine under emulation mode but does nothing in real mode.

I try to describe the smallest failing part of the filter. The filter-kernel works row-wise on the image graylevels (dimension nx*ny). For this purpose I copy first the

incoming image h_bm into device memory d_bm. The kernel gets then the image with its dimensions and some constants b[0]…b[4].

After the kernel is finished I copy the device memory to h_res and return it. The behaviour I get when running NOT in emulation mode is, that in the end I have in

h_res exactly the same image except of the very first array element which differs in 0.0000610352 from the original image.

I guess that this means:

a: The kernel does something, since the difference is not there if I don’t start the kernel at all.

b: I probably screwed up memory assignment or copy which doesn’t come out in emulation mode.

Here is the code-snip for the kernel-call

[codebox]

void cudaGaussianFilter(double *h_bm, long n, int nx, int ny, int nc, double sigma){

double *d_bm,

     *d_bm_transposed,

     *h_res; /* The array where the result is stored */

double q; /* An adapted version of the sigma */

size_t stride,stride_tr; /* The stride I have to use instead of ny */

int blocksize, /* How many thread per block */

gridsize; /* How many blocks in all */

blocksize = NUM_OF_THREADS;

gridsize = (ny%blocksize==0)?ny/blocksize:ny/blocksize+1; /* make enough blocks if

ny is not divisible by the choosen |blocksize| which is usually 32*/

@<Calculate the parameter for the gaussian filter@>@;

CUDA_SAFE_CALL(cudaMallocPitch((void **) &d_bm,&stride,nx*sizeof(double),ny) );

CUDA_SAFE_CALL(

cudaMemcpy2D((void*)d_bm,stride,(void*)h_bm, nx*sizeof(double),

             nx*sizeof(double),ny,cudaMemcpyHostToDevice)

);

h_res = (double ) calloc(sizeof(double), nxny);

cudaGaussKernel<<<gridsize,blocksize>>>(d_bm,stride,nx,ny,nc,b[0],b[1],b[2],b[3],b[4]);

CUDA_SAFE_CALL(

cudaMemcpy2D((void*) h_res, nx*sizeof(double),(void*) d_bm, stride,

nx*sizeof(double), ny, cudaMemcpyDeviceToHost)

);

[/codebox]

And here is what my kernel does. I have as many threads as there are rows in my bitmap. Since every thread calculates one line (row) of the image I should have no memory clashes and every thread can access one line of the memory like it wants.

[codebox]

global void cudaGaussKernel(

double *d_bm,

size_t stride,

int nx,

int ny,

int nc,

double b0, double b1, double b2, double b3, double B){

int pos = blockIdx.x * blockDim.x + threadIdx.x;

if(pos >= ny) return;

int n = 0; /* The position in the row */

double pV; /* The value which is used for padding */

/* Forward iteration. Calculating the boundary-elements by hand. */

  double* in = (double*) ((char*)d_bm+pos*stride);

  pV = in[n];

  in[n++] = B*pV+(b1*pV+b2*pV+b3*pV)/b0;

  in[n++] = B*in[n]+(b1*in[n-1]+b2*pV+b3*pV)/b0;

  in[n++] = B*in[n]+(b1*in[n-1]+b2*in[n-2]+b3*pV)/b0;

  pV = in[n+nx-1];

  while(n<nx){

   in[n++] = B*in[n]+(b1*in[n-1]+b2*in[n-2]+b3*in[n-3])/b0;

  }

  n--; /* Going back to the last element in the row */

/* Backward iteration. Calculating the boundary-elements by hand. */

 in[n--] = B*in[n]+(b1*pV+b2*pV+b3*pV)/b0;

 in[n--] = B*in[n]+(b1*in[n+1]+b2*pV+b3*pV)/b0;

 in[n--] = B*in[n]+(b1*in[n+1]+b2*in[n+2]+b3*pV)/b0;

while(n>=0){

   in[n--] = B*in[n]+(b1*in[n+1]+b2*in[n+2]+b3*in[n+3])/b0;

 }

}

[/codebox]

Currently, I have no idea how I can find the bug. Can anyone give me a hint? I already checked the kernel result with CUT_CHECK_ERROR… nothing.

Cheers

Patrick

alex_dubinsky · December 7, 2008, 9:21pm

The following is the debug strategy of last resort. It works on all compilers, all architectures, requires no printf() or any other library. It is magical. It is beautiful. It is something you need to master to work in CUDA.

Comment everything out until something works. Then uncomment until it doesn’t.

Isn’t it wonderful? Keep in mind the compiler deletes code that doesn’t end up writing to device memory (ie, useless code).

P.S. CUT_CHECK_ERROR only works in debug builds.

halirutan · December 8, 2008, 5:12am

I was afraid of this answer, but I’ll give it a try. Today I had at least 2 ideas what I should check before continuously commenting out. This is the plan for tomorrow.

Thanks,

Patrick

alex_dubinsky · December 8, 2008, 5:49am

Btw, are you using -arch=sm_13?

Because if you’re not, you shouldn’t be using doubles!

cbuchner1 · December 8, 2008, 11:01pm

Emulation mode will not catch when you accidentially pass a pointer to host memory to a kernel or if you accidentially access device memory from the CPU. So this is one thing to look for.

The CUDA_SAFE_CALL macros in Debug mode should throw an error when this occurs. When you directly call a kernel with such incorrect arguments you will see an Unspecified Launch Failure.

Christian

Topic		Replies	Views
Kernel Works in Emulation Mode But Not In Release Mode CUDA Programming and Performance	5	3343	January 7, 2010
Emulator works but G80 doesn't CUDA Programming and Performance	11	5432	July 3, 2007
Error running code that works in emulation mode CUDA Programming and Performance	5	3921	July 19, 2008
How or do you use emulation mode CUDA Programming and Performance	12	6801	September 11, 2008
strange behavior with device emulation CUDA Programming and Performance	5	2693	May 20, 2008
hardware breakpoint or something :) CUDA Programming and Performance	2	1380	September 14, 2008
Weird error CUDA Programming and Performance	5	2613	August 27, 2007
CUDA hangs on GPU but not in emulation CUDA Programming and Performance	7	5352	August 21, 2008
Help with strange error CUDA Programming and Performance	8	2095	February 25, 2010
Correct output with emulation mode, wrong with GPU/Execution CUDA Programming and Performance	6	3324	March 25, 2010

Working emulation program but failing gpu program How to do a bug search when the emulation runs fin

Related topics