Strange Behavior on image processing

LeRoi · September 7, 2008, 1:01pm

Hi all,

I’m writing a simple edge detection with CUDA which works generally fine. But I’m facing a problem with the number of blocks and threads per block on the kernel call. If I use a big number of blocks on the call for the first image I get crappy results, if i use a small number of blocks like 8 everything works fine even if I increase the number of blocks for further calls. I don’t know where this behaviour comes from. Maybe someone has an idea. Every hint is appreciated :) .

Some Facts:

CUDA 1.1 on Windows XP

GPU: 9500M GS

The CUDA Code:

__global__

void operateRobert(unsigned char* data, unsigned char* res, int width, int height){

	// get indices (position in memory)

	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

	

	// check in value in valid region

	if(x>0 && y>0 && x<width-1 && y<height-1){

	

	// var to store calculated value

	

  int value;

	

	// calculate value

	value = abs((data[(y*width)+x] - data[((y+1)*width)+(x+1)]) +  (data[((y+1)*width)+x] -data[(y*width)+(x+1)]));

	if(value>255){

  value=255;

	}

	// store result

	res[(y*width)+x]=(char)value;

	}

	

}//operate

// calling the kernel

dim3 block(128, 128, 1);

dim3 grid(width / block.x, height / block.y, 1);

// do calculation

operateRobert<<<grid,block,0>>>(d_d,result,width,height);

Regards LeRoi

cbuchner1 · September 7, 2008, 2:56pm

This is probably not solving your “crappy results” problem. Just in case your image size is not a multiple of the block dimension, rounding up of your grid dimension like this may be helpful. Your bounds check against width and height inside the kernel make this safe.

// calling the kernel

dim3 block(128, 128, 1);

dim3 grid((width + block.x-1) / block.x, (height + block.y-1) / block.y, 1);

// do calculation

operateRobert<<<grid,block,0>>>(d_d,result,width,height);

Generally you should consider placing your input data in a texture - alternatively consider moving blocks of input data to shared memory first before running the edge detection filter on it . Right now you’re having 4 memory accesses per pixel and these run through slow global memory in an uncoalesced manner - in other words you’re not getting peak bandwidth performance and additionally you’re consuming 4 times the required bandwidth for your reads.

Christian

Quoc_Vinh · September 8, 2008, 9:28am

Hi all,

I’m writing a simple edge detection with CUDA which works generally fine. But I’m facing a problem with the number of blocks and threads per block on the kernel call. If I use a big number of blocks on the call for the first image I get crappy results, if i use a small number of blocks like 8 everything works fine even if I increase the number of blocks for further calls. I don’t know where this behaviour comes from. Maybe someone has an idea. Every hint is appreciated :) .

Some Facts:

CUDA 1.1 on Windows XP

GPU: 9500M GS

The CUDA Code:
__global__

void operateRobert(unsigned char* data, unsigned char* res, int width, int height){

	// get indices (position in memory)

	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

 Â  Â unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

	

	// check in value in valid region

	if(x>0 && y>0 && x<width-1 && y<height-1){

	

	// var to store calculated value

	

 Â int value;

	

	// calculate value

	value = abs((data[(y*width)+x] - data[((y+1)*width)+(x+1)]) + Â (data[((y+1)*width)+x] -data[(y*width)+(x+1)]));

	if(value>255){

 Â value=255;

	}

	// store result

	res[(y*width)+x]=(char)value;

	}

	

}//operate

// calling the kernel

dim3 block(128, 128, 1);

dim3 grid(width / block.x, height / block.y, 1);

// do calculation

operateRobert<<<grid,block,0>>>(d_d,result,width,height);
you should pay attention to the Maximum Number of threads in a blocks.

“The maximum number of threads per block is 512;”

in your case “dim3 block(128, 128, 1);”

so the total of threads in each block will be 128x128x1 >>512

to get the maximum threads in each block you can define

dim3 block(32, 16, 1); or dim3 block(512, 1, 1); or dim3 block(1, 512, 1);

good luck :)

Regards LeRoi

[snapback]436644[/snapback]

LeRoi · September 8, 2008, 1:13pm

Thanks for the replies. Quoc Vinh you’re right, the thread per block limit was my failure the program works fine now. my problem is the speed now. before sticking to the 512 limit my results were much faster (but not working probably). why ?? besides the texture and shared memory things christian pointed out, are there ways to speed up the calculation??

Robert

Topic		Replies	Views
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6281	November 16, 2010
Strange CUDA Image Processing behavior CUDA Programming and Performance	1	2434	November 9, 2009
Why Can't it run? CUDA Programming and Performance	1	2766	December 25, 2008
Run a million threads or blocks on a single kernel function, and still works. It supposed to be 512 at maximum, isn't it? CUDA Programming and Performance	4	1404	January 6, 2017
Unexpected behavior with varying number of threads per block CUDA Programming and Performance	2	3474	November 5, 2008
Cuda good practices for image processing CUDA Programming and Performance	8	15723	February 12, 2009
Blur in 1D array CUDA Programming and Performance	4	4593	December 22, 2009
Threads Per Block Issue CUDA Programming and Performance	2	940	September 7, 2010
Problem with Sobel Filter unwanted lines appear in the picture CUDA Programming and Performance	3	1169	November 18, 2010
An illegal memory access was encountered CUDA Programming and Performance cuda	2	952	December 1, 2022

Strange Behavior on image processing

Related topics