Working emulation program but failing gpu program How to do a bug search when the emulation runs fin


I implemented a simple gaussian filter which runs fine under emulation mode but does nothing in real mode.

I try to describe the smallest failing part of the filter. The filter-kernel works row-wise on the image graylevels (dimension nx*ny). For this purpose I copy first the

incoming image h_bm into device memory d_bm. The kernel gets then the image with its dimensions and some constants b[0]…b[4].

After the kernel is finished I copy the device memory to h_res and return it. The behaviour I get when running NOT in emulation mode is, that in the end I have in

h_res exactly the same image except of the very first array element which differs in 0.0000610352 from the original image.

I guess that this means:

a: The kernel does something, since the difference is not there if I don’t start the kernel at all.

b: I probably screwed up memory assignment or copy which doesn’t come out in emulation mode.

Here is the code-snip for the kernel-call


void cudaGaussianFilter(double *h_bm, long n, int nx, int ny, int nc, double sigma){

double *d_bm,


     *h_res; /* The array where the result is stored */

double q; /* An adapted version of the sigma */

size_t stride,stride_tr; /* The stride I have to use instead of ny */

int blocksize, /* How many thread per block */

gridsize; /* How many blocks in all */

blocksize = NUM_OF_THREADS;

gridsize = (ny%blocksize==0)?ny/blocksize:ny/blocksize+1; /* make enough blocks if

ny is not divisible by the choosen |blocksize| which is usually 32*/

@<Calculate the parameter for the gaussian filter@>@;

CUDA_SAFE_CALL(cudaMallocPitch((void **) &d_bm,&stride,nx*sizeof(double),ny) );


cudaMemcpy2D((void*)d_bm,stride,(void*)h_bm, nx*sizeof(double),



h_res = (double ) calloc(sizeof(double), nxny);



cudaMemcpy2D((void*) h_res, nx*sizeof(double),(void*) d_bm, stride,

nx*sizeof(double), ny, cudaMemcpyDeviceToHost)



And here is what my kernel does. I have as many threads as there are rows in my bitmap. Since every thread calculates one line (row) of the image I should have no memory clashes and every thread can access one line of the memory like it wants.


global void cudaGaussKernel(

double *d_bm,

size_t stride,

int nx,

int ny,

int nc,

double b0, double b1, double b2, double b3, double B){

int pos = blockIdx.x * blockDim.x + threadIdx.x;

if(pos >= ny) return;

int n = 0; /* The position in the row */

double pV; /* The value which is used for padding */

/* Forward iteration. Calculating the boundary-elements by hand. */

  double* in = (double*) ((char*)d_bm+pos*stride);

  pV = in[n];

  in[n++] = B*pV+(b1*pV+b2*pV+b3*pV)/b0;

  in[n++] = B*in[n]+(b1*in[n-1]+b2*pV+b3*pV)/b0;

  in[n++] = B*in[n]+(b1*in[n-1]+b2*in[n-2]+b3*pV)/b0;

  pV = in[n+nx-1];


   in[n++] = B*in[n]+(b1*in[n-1]+b2*in[n-2]+b3*in[n-3])/b0;


  n--; /* Going back to the last element in the row */

/* Backward iteration. Calculating the boundary-elements by hand. */

 in[n--] = B*in[n]+(b1*pV+b2*pV+b3*pV)/b0;

 in[n--] = B*in[n]+(b1*in[n+1]+b2*pV+b3*pV)/b0;

 in[n--] = B*in[n]+(b1*in[n+1]+b2*in[n+2]+b3*pV)/b0;


   in[n--] = B*in[n]+(b1*in[n+1]+b2*in[n+2]+b3*in[n+3])/b0;




Currently, I have no idea how I can find the bug. Can anyone give me a hint? I already checked the kernel result with CUT_CHECK_ERROR… nothing.



The following is the debug strategy of last resort. It works on all compilers, all architectures, requires no printf() or any other library. It is magical. It is beautiful. It is something you need to master to work in CUDA.

Comment everything out until something works. Then uncomment until it doesn’t.

Isn’t it wonderful? Keep in mind the compiler deletes code that doesn’t end up writing to device memory (ie, useless code).

P.S. CUT_CHECK_ERROR only works in debug builds.

I was afraid of this answer, but I’ll give it a try. Today I had at least 2 ideas what I should check before continuously commenting out. This is the plan for tomorrow.



Btw, are you using -arch=sm_13?

Because if you’re not, you shouldn’t be using doubles!

Emulation mode will not catch when you accidentially pass a pointer to host memory to a kernel or if you accidentially access device memory from the CPU. So this is one thing to look for.

The CUDA_SAFE_CALL macros in Debug mode should throw an error when this occurs. When you directly call a kernel with such incorrect arguments you will see an Unspecified Launch Failure.