Strange Behavior on image processing

Hi all,

I’m writing a simple edge detection with CUDA which works generally fine. But I’m facing a problem with the number of blocks and threads per block on the kernel call. If I use a big number of blocks on the call for the first image I get crappy results, if i use a small number of blocks like 8 everything works fine even if I increase the number of blocks for further calls. I don’t know where this behaviour comes from. Maybe someone has an idea. Every hint is appreciated :) .

Some Facts:

CUDA 1.1 on Windows XP

GPU: 9500M GS

The CUDA Code:


void operateRobert(unsigned char* data, unsigned char* res, int width, int height){

	// get indices (position in memory)

	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;


	// check in value in valid region

	if(x>0 && y>0 && x<width-1 && y<height-1){


	// var to store calculated value


  int value;


	// calculate value

	value = abs((data[(y*width)+x] - data[((y+1)*width)+(x+1)]) +  (data[((y+1)*width)+x] -data[(y*width)+(x+1)]));




	// store result





// calling the kernel

dim3 block(128, 128, 1);

dim3 grid(width / block.x, height / block.y, 1);

// do calculation


Regards LeRoi

This is probably not solving your “crappy results” problem. Just in case your image size is not a multiple of the block dimension, rounding up of your grid dimension like this may be helpful. Your bounds check against width and height inside the kernel make this safe.

// calling the kernel

dim3 block(128, 128, 1);

dim3 grid((width + block.x-1) / block.x, (height + block.y-1) / block.y, 1);

// do calculation


Generally you should consider placing your input data in a texture - alternatively consider moving blocks of input data to shared memory first before running the edge detection filter on it . Right now you’re having 4 memory accesses per pixel and these run through slow global memory in an uncoalesced manner - in other words you’re not getting peak bandwidth performance and additionally you’re consuming 4 times the required bandwidth for your reads.


Thanks for the replies. Quoc Vinh you’re right, the thread per block limit was my failure the program works fine now. my problem is the speed now. before sticking to the 512 limit my results were much faster (but not working probably). why ?? besides the texture and shared memory things christian pointed out, are there ways to speed up the calculation??