Is it correct for mono image process

Dear all:

After my testing, even make all L/UL of global mem are coalesced memoy access when using share memoy ,

The transfer time is still there.(becasue I have to do syncthread after read like code below).

__shared__ float share[BLOCK_SIZESglLRX][BLOCK_SIZESglLRY];

	unsigned int xIndex,yIndex,index_in;

	xIndex = blockIdx.x * BLOCK_SIZESglLRX + threadIdx.x;-blockIdx.x*Pitch;

	yIndex = blockIdx.y * BLOCK_SIZESglLRY + threadIdx.y;

	index_in = yIndex * devImgSizeX + xIndex;

	if (xIndex<devImgSizeX && yIndex<devImgSizeY){

		share[threadIdx.x][threadIdx.y] = *(S+index_in);

	}

	__syncthreads();

If I want to make a 2d convolution process(not seperable), like smooth or laplacian, I think to use texture memory will be faster then use share memoy.

Is it right?

I’m not sure if is faster but I know that shared memory is very very fast. Are you interested on computer vision? Do you know where I can find information about it and cuda??

Try it and find out!