I installed the latest CUDA software yesterday, and write a program about image processing . The Debug state shows the right result ,but the release is not right.I use Visual studio 2010 & cuda 4.0. My GPU is GT240M.
This is the kernel :
//Prewitt
__global__ void kernelGrayPrewitt(uchar *src, uchar *dst,unsigned int width, unsigned int heigth, const uchar thresh)
{
int col = threadIdx.x + blockIdx.x*blockDim.x;
int row = threadIdx.y + blockIdx.y*blockDim.y;
if (col < 1 || row < 1 || row > heigth - 1 || col > width - 2)
{
return;
}
int offset = row*width + col;
int offsetUp = (row-1) * width + col;
int offsetDown = (row+1) * width + col;
uchar data[3][3] =
{ {src[offsetUp-1], src[offsetUp], src[offsetUp+1]},
{src[offset-1], src[offset], src[offset+1] },
{src[offsetDown-1], src[offsetDown], src[offsetDown+1]}};//prepare the data.
unsigned int po1 = data[0][2]+data[1][2]+data[2][2];
unsigned int po2 = data[0][0]+data[1][0]+data[2][0];
uchar dx = po1 > po2? po1-po2:po2-po1;
po1 = data[0][0]+data[0][1]+data[0][2];
po2 = data[2][0]+data[2][1]+data[2][2];
uchar dy = po1 > po2? po1-po2:po2-po1;
dst[offset] = dy > dx? dy:dx;
if (dst[offset] > thresh)
{
dst[offset] = 255u;
}
else
{
dst[offset] = 0;
}
}
Are “src” and “dst” separate, non-overlapping memory regions? Does the app that calls this kernel check the return code of every CUDA API call and every kernel launch?
Thanks for your reply. I did not insert the right image.I have modified the image. thresh = 80u. I use openCV read the image. My code is so complex that i can’t show all the code . Every thread process 9 pixels and output 1 pixel.
A compiler issue cannot be excluded. If you carefully checked for all other potential sources of problem, and if you cannot attach a self-contained repro case at the moment, you could try gradually reducing the optimization level. There are two relevant optimization flags, one for the Open64 frontend, the other for PTXAS. Note that these component-level flags should not be used in production builds, but they are useful for experiments like we are doing here. The flags are:
-Xopencc -O{0|1|2|3}
-Xptxas -O{0|1|2|3}
The default for release builds is -O3 for both. Starting with the standard release build, and changing nothing else, reduce optimization level one at a time, starting with PTXAS. As you work your way towards lower optimization levels, you should presumably start getting the correct output at some point. I am fairly (but not entirely) sure that debug builds correspond to -Xopencc -O0 and -Xptxas -O0.
[later:]
You might want to try adding -Xopencc -Wall to see if the compiler can spot anything out of the ordinary.