Why does Debug & Release get different result

I installed the latest CUDA software yesterday, and write a program about image processing . The Debug state shows the right result ,but the release is not right.I use Visual studio 2010 & cuda 4.0. My GPU is GT240M.

This is the kernel :


__global__ void kernelGrayPrewitt(uchar *src, uchar *dst,unsigned int width, unsigned int heigth, const uchar thresh)


  int col = threadIdx.x + blockIdx.x*blockDim.x;  

  int row = threadIdx.y + blockIdx.y*blockDim.y;  


  if (col < 1 || row < 1 || row > heigth - 1 || col > width - 2)  




  int offset = row*width + col;    

int offsetUp = (row-1) * width + col;   

  int offsetDown = (row+1) * width + col;  

  uchar data[3][3] =   

  { {src[offsetUp-1], src[offsetUp], src[offsetUp+1]},     

    {src[offset-1],   src[offset],  src[offset+1]   },    

    {src[offsetDown-1], src[offsetDown], src[offsetDown+1]}};//prepare the data.	   

unsigned int po1 = data[0][2]+data[1][2]+data[2][2];      

  unsigned int po2 = data[0][0]+data[1][0]+data[2][0];      

uchar dx = po1 > po2? po1-po2:po2-po1;       


  po1 = data[0][0]+data[0][1]+data[0][2];       

  po2 = data[2][0]+data[2][1]+data[2][2];        

uchar dy = po1 > po2? po1-po2:po2-po1;        

dst[offset] = dy > dx? dy:dx;     

if (dst[offset] > thresh)   


     dst[offset] = 255u;  




     dst[offset] = 0;   



This is my result:

1 Release result.

2 debug result

3 origin image

Thanks for your attention. looking for your reply . Thank you very much .

Undeterministic results immediately make me think of bad synchronization… that being

said I can’t spot anything in your code as every thread just seems to output one pixel

using the src-input. How are you calling the kernel?

And if the second image is the desired result, why this portion of the code?

if (dst[offset] > thresh)   


     dst[offset] = 255u;  




     dst[offset] = 0;   


It seems to me that this just makes the image black and white, as in the first image.

“dst” contains the pixels you are showing as image, right?

Are your project configurations the same for both your debug and release builds?

Are “src” and “dst” separate, non-overlapping memory regions? Does the app that calls this kernel check the return code of every CUDA API call and every kernel launch?

Thanks for your reply. I did not insert the right image.I have modified the image. thresh = 80u. I use openCV read the image. My code is so complex that i can’t show all the code . Every thread process 9 pixels and output 1 pixel.

I have a check about it. It is basically equal. lib folder,include folder.

Yes , they are separate. The Debug result is right . Is the toolkit’s bug?

Nice pic. Lena…Pretty good!

Figuring the contours of Lena through GPU!!! Ha ha… :-)

A compiler issue cannot be excluded. If you carefully checked for all other potential sources of problem, and if you cannot attach a self-contained repro case at the moment, you could try gradually reducing the optimization level. There are two relevant optimization flags, one for the Open64 frontend, the other for PTXAS. Note that these component-level flags should not be used in production builds, but they are useful for experiments like we are doing here. The flags are:

-Xopencc -O{0|1|2|3}
-Xptxas -O{0|1|2|3}

The default for release builds is -O3 for both. Starting with the standard release build, and changing nothing else, reduce optimization level one at a time, starting with PTXAS. As you work your way towards lower optimization levels, you should presumably start getting the correct output at some point. I am fairly (but not entirely) sure that debug builds correspond to -Xopencc -O0 and -Xptxas -O0.


You might want to try adding -Xopencc -Wall to see if the compiler can spot anything out of the ordinary.

Thanks for your reply . I will have a try . Thanks a lot .

Sir, Today I come back from another city .And the issue has not been solved .

if I change the code

dst[offset] = dy > dx? dy:dx;


dst[offset] = dy;


dst[offset] = dx; .

the result is right .

I am so tired about it …

you have 128 line good on the image 128 it s 256/2 search why

128 line 512 pixel = 65536

Hello , What’s your meaning .I don’t understand . :shifty:

65536 it s the limit in a grid

65536 can be the limit for a small int

so what type have variable in dst[offset] = dy > dx? dy:dx;

hello guys, I have solved the trouble. It is the type error. Different configuration has different implementation for nvcc.
Thank you very much.