How to make data transfer be fatest

I’d like to do a convolution on a unsigned 16 bit gray image.
but to read it is not efficiently because its not 32bit data.
so I’d like to map the data to a float type array.
but it seems still not a good method after my test.
Is threre any better method to make the read and write of it more efficently?

my gpu is 9800gtx
which is not optimized for 16bit read/write.

now I change the code to be texture memory access and its better. but I’d like to know if any method to make share memory be faster than using texture in the requirement.

the filter is like 5*5 average.

If your data is a flat array, it’s probably not too hard.
Read the data from global memory using one word per thread. Then unpack that into two 32 bit floats, and write them to shared memory. Do all your ops in shared memory, then reverse the process to combine two values into one for writing back to global memory.

If you’re doing something like filtering, you’ll want to load it into shared memory regardless for fast access by all your block’s threads.

Hi, Below is my code, but its quiet slow, do any one know the root cause? And how to improve it?

__global__ void DataMapUShtToFltNew(unsigned short* S,float* D,int devImgSizeX,int devImgSizeY)


	__shared__ long sharelong[BLOCK_SIZEX];

	unsigned int xIndex1;

	unsigned int xIndex2;

	long *longpntr=(long *)S;

	xIndex1 = 2*(blockIdx.x * BLOCK_SIZEX + threadIdx.x);

	xIndex2 = xIndex1+1;

	if (xIndex1<devImgSizeX*devImgSizeY && xIndex2<devImgSizeX*devImgSizeY){

		// Read 2 short at a time

		sharelong[threadIdx.x] = *(longpntr+(xIndex1/2));

		*(D+xIndex1)=0xFF & (sharelong[threadIdx.x]>>(0));

		*(D+xIndex2)=0xFF & (sharelong[threadIdx.x]>>(16));