Image flip


i wrote a kernel to flip an image. My idea is to read an block to smem, flip the smem

and store the block to the eqivalent image place.

If i bench this kernel vs. 2Ghz CPU, with an

blockDim (16,16) and gridDim (64, 48)

i will only get 5x speedup on a 260GTX and 1.5x speedup on a 9800GT.

Perhaps this kernel is bound on memory speed or i make sth. wrong.

have you any hints for me?

__global__ void kernel (unsigned char* pictureIn, unsigned char* result)


	__shared__ unsigned char sBlockA[16][16];   // 1. shared memory for reverse swap

	__shared__ unsigned char sBlockB[16][16];   // 2. shared memory for reverse swap

	int tx = threadIdx.x;   // thread index X-Dir

	int ty = threadIdx.y;   // thread index Y-Dir

	int dX = gridDim.x * blockDim.x;	// stride in X-Dir e.g. the pictures width

	int dY = gridDim.y * blockDim.y;	// stride in Y-Dir e.g. the pictures height

	int blockIdxA = blockIdx.x * blockDim.x + blockIdx.y * blockDim.y * dX;	 // blockoffset to begin read

	int blockIdxB = dX * dY - blockIdxA - blockDim.y * dX - blockDim.x;		 // blockoffset to store data

	sBlockA[ty][tx] = pictureIn[blockIdxA + ty * dX + tx];  // linear data fetch from global memory

	__syncthreads ();   // wait for all threads to reach this point

	sBlockB[ty][tx] = sBlockA[15-ty][15-tx]; // mirror each element in the cache

	__syncthreads();   // wait for all threads to reach this point

	result[blockIdxB + ty * dX + tx] = sBlockB[ty][tx];   // linear data store in global memory


Fetching and storing characters is lower bandwidth than other types, even when coalesced.

If possible, align the image to a 64-byte boundary, and pad the image so each row is aligned to 64 bytes.

Then if you cast the pointers to be int*, and perform the exact same algorithm (but with 1/4 the width of course) I think you will see much improvement.

Note the memory MUST be 4-byte aligned for casting to int* to be valid, otherwise you will get garbage. And it must be 64-byte aligned to get coalescing on 1.1 devices.

thnx for your help, but there is one thing i have a problem.

Currently i alloc the memory with cudaMallocPitch () so the memory align should be fit.

If i fetch an integer from global memory and store it to the 16x16 smem block typeof int.

I have to flip each byte from one integer in the smem, and i dont know a good solution for this :( like an

bitshift operation or sth.


you can cast the int to a char4 type, and do something like this:

char4 * charType = & intPixelData;

charType->x = charType->w;

charType->y = charType->z;

charType->z = charType->y;

charType->w = charType->x;

… but obviously you need a tmp variable or XOR to swap the values, but this is the general idea.

Actually, I would use a char4 type struct for all your calculations, instead of an int, if you’re actually storing byte/char-size data anyway.

thnx all for your help now it works fine and 15-30x faster than the cpu =)

You can remove the second shared memory array “sBlockB”, when you write back to “result” just read from “sBlockA” using “sBlockA[15-ty][15-tx]”. This will save you a read and write to shared memory and a syncthreads().