Is it already the mostly optimized version?

Dear all:

I wrote the code below to make the convolution filtering( filter is like [-1,0,0,0,0,1])

but I found the most time cost is to read the data from global memory to share memory.

and it is much slower than doing the filter by texture memory process.

Is it already the most optimized way to do this through share memory?

Or to make the filter, texture is better than share memory?

__global__ void CPLBi(unsigned short* S,char* D,int devImgSizeX,int devImgSizeY,int Pitch,float th)


	__shared__ float share[BLOCK_SIZESglLRX][BLOCK_SIZESglLRY];

	unsigned int xIndex,yIndex,index_in;

	xIndex = blockIdx.x * BLOCK_SIZESglLRX + threadIdx.x-blockIdx.x*Pitch;

	yIndex = blockIdx.y * BLOCK_SIZESglLRY + threadIdx.y;

	index_in = yIndex * devImgSizeX + xIndex;

	if (xIndex<devImgSizeX && yIndex<devImgSizeY){

		share[threadIdx.x][threadIdx.y] = *(S+index_in);



	if (xIndex>(Pitch-1) && xIndex<devImgSizeX && yIndex<devImgSizeY){

		if (threadIdx.x>(Pitch-1) && ((share[threadIdx.x][threadIdx.y]-share[threadIdx.x-Pitch][threadIdx.y])>th)){



		else if (threadIdx.x>(Pitch-1) && ((share[threadIdx.x][threadIdx.y]-share[threadIdx.x-Pitch][threadIdx.y])<=th)){




	else if (xIndex<Pitch && yIndex<devImgSizeY){





You may want to check whether your memory operation is coalesced or not. If you are running on a 1.1 device (G92 or older) it’s most likely that your memory operations (both loading and storing) are not coalesced. Chapter 5.1.2 in the Programming Guide has a detail explanation about the condition of coalesced operations.

You may also want to check whether you have bank conflict issues in your share memory. For example, if BLOCK_SIZESglLRY is a multiple of your number of threads, bank conflict may occur.

dear pc chen:

thanks for your reply, I found what happed

my image sizex is 4872, and is not multiple of my blocksizex=96

and for 1.1 version, my image data type is unsigned short which is not 32 bits

after I change the image size to be 4096 and blocksizex to be 128

then all of the loading is coalesced. and the speed up is about 2.5X compared to the original one.

thanks a lot