Kernel Konfiguration and Runtime


Im thinking about increaisng the speed of my kernel

currently im woking on a 256x256 block grid and my results are not very fast

i was aksing myself, if a kernel can get faster if you change the number of blocks and sizes
for example take 512 blocks a 128 threads

might that decrease my runtime?

Thanks for your Time!


Possibly, but unlikely. Unless it accidently increase your occupancy a bit, you will have the same number of threads running on each stream multiprocessor.
If you are discontent about the speed of your code, most likely it is not parallel enough and requires some algorithmic rethinking. What is it that you want to implement?

Im implementing a median 3x3 filter on a 256 x 256 image

the grid is launched with <<<256,256>>>

so one block resembles a image line with 256 threads

since for a 3x3 filter actually 3 rows of my original image are used to compute one row in the rsulting image, im putting those 3 lines in shared memory at the beginning of each new block

strangly this version is slower than the one when all calculations are directly done on the global memory, which is strange in my thinking

i was just having the idea that using smaller blocks might increase the speed?!

I would use the shared memory too and it should be faster than reading data from global memory 9 times for each thread. If you don’t do anything else, there should be enough memory for you to do that without hampering the occupancy (assuming 1.3 device). There shouldn’t be any bank conflicts either. So I don’t know, maybe if you published your code?

this function is called iwth threadIdx.x == 0

it fills 3 rows into shared memory

[codebox]device_ void fillMaskData(unsigned char * origData, unsigned char sDataMask , int type,int Width){

	int start = blockIdx.x * Width;

	int end = start +3*Width;

	for(int i = start, j = 0; i < end; i++,j++){

		sDataMask[j] = origData[i];




this unction writes from shared memory to global in the end of th last thread in each block

[codebox]device void fillwithRowData(unsigned char *datatowrite, unsigned char sDataMask ,int offsetrows,int Width){

int start = (blockIdx.x+offsetrows)*Width;

for(int i = 1; i < Width-1; i++){

	datatowrite[start+i] = sDataMask[i];




this is my kernel, launched in 254x254 -> margin of 1x1 pixels

[codebox]global void doMedianFilter3x3(unsigned char *data_d_orig ,unsigned char *data_d, int Height, int Width){

extern __shared__  unsigned char sData[];  // Placeholder for 4 rows -> 3 mask, 1 row calculated

unsigned char * sDataMask = (unsigned char *)sData;

unsigned char * RowData =(unsigned char *)&sDataMask[3*Width];

unsigned char medianValues[9];

// Fill shared Data shared Mask at beginning of each block

if(threadIdx.x == 0)  


//Get the Neighborhood

	int idx = (blockIdx.x + 1)  * Width + 1 + threadIdx.x;

	int sidx =   Width + 1 + threadIdx.x;  // current pixel -> +1 because of a margin

	int pix00 = threadIdx.x; // Upper left

	int pix10 = 1 + threadIdx.x; // Upper middle

	int pix20 = 2 + threadIdx.x; // Upper right

	int pix01 = Width + threadIdx.x; // left

	int pix21 = Width + 2 + threadIdx.x; // right

	int pix02 = 2 * Width + threadIdx.x; // lower left

	int pix12 = 2 * Width + 1 + threadIdx.x; // lower middle

	int pix22 = 2 * Width + 2 + threadIdx.x;// lower Right


	medianValues[0] = sDataMask[pix00];

	medianValues[1] = sDataMask[pix10];

	medianValues[2] = sDataMask[pix20];

	medianValues[3] = sDataMask[pix01];

	medianValues[4] = sDataMask[sidx];

	medianValues[5] = sDataMask[pix21];

	medianValues[6] = sDataMask[pix02];

	medianValues[7] = sDataMask[pix12];

	medianValues[8] = sDataMask[pix22];

unsigned char MedianValue = getMedianValue(medianValues,9);

RowData[threadIdx.x+1] = MedianValue;


if(threadIdx.x == Width-2-1)



do you thgink there is something wrong with my code?



You should fill your shared memory in parallel if possible.

so you mean each thread should read the portion of original data it need into its fraction of shared memory?