Hi all,
I’m writing a simple edge detection with CUDA which works generally fine. But I’m facing a problem with the number of blocks and threads per block on the kernel call. If I use a big number of blocks on the call for the first image I get crappy results, if i use a small number of blocks like 8 everything works fine even if I increase the number of blocks for further calls. I don’t know where this behaviour comes from. Maybe someone has an idea. Every hint is appreciated :) .
Some Facts:
CUDA 1.1 on Windows XP
GPU: 9500M GS
The CUDA Code:
__global__
void operateRobert(unsigned char* data, unsigned char* res, int width, int height){
// get indices (position in memory)
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
// check in value in valid region
if(x>0 && y>0 && x<width-1 && y<height-1){
// var to store calculated value
int value;
// calculate value
value = abs((data[(y*width)+x] - data[((y+1)*width)+(x+1)]) + (data[((y+1)*width)+x] -data[(y*width)+(x+1)]));
if(value>255){
value=255;
}
// store result
res[(y*width)+x]=(char)value;
}
}//operate
// calling the kernel
dim3 block(128, 128, 1);
dim3 grid(width / block.x, height / block.y, 1);
// do calculation
operateRobert<<<grid,block,0>>>(d_d,result,width,height);
Regards LeRoi