Kernel Konfiguration and Runtime

mazzok · March 4, 2010, 6:40pm

Hello,

Im thinking about increaisng the speed of my kernel

currently im woking on a 256x256 block grid and my results are not very fast

i was aksing myself, if a kernel can get faster if you change the number of blocks and sizes
for example take 512 blocks a 128 threads

might that decrease my runtime?

Thanks for your Time!

Maz

Cygnus_X1 · March 4, 2010, 7:30pm

Possibly, but unlikely. Unless it accidently increase your occupancy a bit, you will have the same number of threads running on each stream multiprocessor.
If you are discontent about the speed of your code, most likely it is not parallel enough and requires some algorithmic rethinking. What is it that you want to implement?

mazzok · March 4, 2010, 10:06pm

Im implementing a median 3x3 filter on a 256 x 256 image

the grid is launched with <<<256,256>>>

so one block resembles a image line with 256 threads

since for a 3x3 filter actually 3 rows of my original image are used to compute one row in the rsulting image, im putting those 3 lines in shared memory at the beginning of each new block

strangly this version is slower than the one when all calculations are directly done on the global memory, which is strange in my thinking

i was just having the idea that using smaller blocks might increase the speed?!

Cygnus_X1 · March 5, 2010, 3:29am

I would use the shared memory too and it should be faster than reading data from global memory 9 times for each thread. If you don’t do anything else, there should be enough memory for you to do that without hampering the occupancy (assuming 1.3 device). There shouldn’t be any bank conflicts either. So I don’t know, maybe if you published your code?

mazzok · March 5, 2010, 1:13pm

this function is called iwth threadIdx.x == 0

it fills 3 rows into shared memory

[codebox]device_ void fillMaskData(unsigned char * origData, unsigned char sDataMask , int type,int Width){

	int start = blockIdx.x * Width;

	int end = start +3*Width;

	for(int i = start, j = 0; i < end; i++,j++){

		sDataMask[j] = origData[i];

		__syncthreads();

	}

}[/codebox]

this unction writes from shared memory to global in the end of th last thread in each block

[codebox]device void fillwithRowData(unsigned char *datatowrite, unsigned char sDataMask ,int offsetrows,int Width){

int start = (blockIdx.x+offsetrows)*Width;

for(int i = 1; i < Width-1; i++){

	datatowrite[start+i] = sDataMask[i];

	__syncthreads();

}

}[/codebox]

this is my kernel, launched in 254x254 → margin of 1x1 pixels

[codebox]global void doMedianFilter3x3(unsigned char *data_d_orig ,unsigned char *data_d, int Height, int Width){

extern __shared__  unsigned char sData[];  // Placeholder for 4 rows -> 3 mask, 1 row calculated

unsigned char * sDataMask = (unsigned char *)sData;

unsigned char * RowData =(unsigned char *)&sDataMask[3*Width];

unsigned char medianValues[9];



// Fill shared Data shared Mask at beginning of each block

if(threadIdx.x == 0)  

	fillMaskData(data_d_orig,sDataMask,3,Width);



//Get the Neighborhood

	int idx = (blockIdx.x + 1)  * Width + 1 + threadIdx.x;

	int sidx =   Width + 1 + threadIdx.x;  // current pixel -> +1 because of a margin

	int pix00 = threadIdx.x; // Upper left

	int pix10 = 1 + threadIdx.x; // Upper middle

	int pix20 = 2 + threadIdx.x; // Upper right

	int pix01 = Width + threadIdx.x; // left

	int pix21 = Width + 2 + threadIdx.x; // right

	int pix02 = 2 * Width + threadIdx.x; // lower left

	int pix12 = 2 * Width + 1 + threadIdx.x; // lower middle

	int pix22 = 2 * Width + 2 + threadIdx.x;// lower Right

	

	medianValues[0] = sDataMask[pix00];

	medianValues[1] = sDataMask[pix10];

	medianValues[2] = sDataMask[pix20];

	medianValues[3] = sDataMask[pix01];

	medianValues[4] = sDataMask[sidx];

	medianValues[5] = sDataMask[pix21];

	medianValues[6] = sDataMask[pix02];

	medianValues[7] = sDataMask[pix12];

	medianValues[8] = sDataMask[pix22];



unsigned char MedianValue = getMedianValue(medianValues,9);

RowData[threadIdx.x+1] = MedianValue;

//__syncthreads();

if(threadIdx.x == Width-2-1)

	fillwithRowData(data_d,RowData,1,Width);

}[/codebox]

do you thgink there is something wrong with my code?

Regards

maz

StickGuy · March 5, 2010, 3:28pm

You should fill your shared memory in parallel if possible.

mazzok · March 6, 2010, 2:39pm

so you mean each thread should read the portion of original data it need into its fraction of shared memory?

Topic		Replies	Views
Image convolution with Shared Memory CUDA Programming and Performance	0	1139	March 2, 2010
Proper Shared Memory Usage for 3D Median Filtering CUDA Programming and Performance cuda	1	1117	April 22, 2021
help with some cuda programming CUDA Programming and Performance	9	1818	August 31, 2009
Significantly lower device memory bandwidth when using higher thread counts CUDA Programming and Performance	2	188	February 6, 2024
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16300	January 30, 2011
2 kernel launch or one kernel with multiple blocks reading global memory? CUDA Programming and Performance	15	2658	May 14, 2014
How number of threads affect performance? CUDA Programming and Performance	8	5901	January 6, 2010
[Help] Kernel Optimization Image subsampling CUDA Programming and Performance	2	4210	July 30, 2007
CUDA perormances CUDA Programming and Performance	10	7129	January 22, 2008
Optimizing Stack Matrix Multiplication Kernel CUDA Programming and Performance	1	1845	September 26, 2009

Kernel Konfiguration and Runtime

Related topics