i wrote a kernel to flip an image. My idea is to read an block to smem, flip the smem
and store the block to the eqivalent image place.
If i bench this kernel vs. 2Ghz CPU, with an
blockDim (16,16) and gridDim (64, 48)
i will only get 5x speedup on a 260GTX and 1.5x speedup on a 9800GT.
Perhaps this kernel is bound on memory speed or i make sth. wrong.
have you any hints for me?
__global__ void kernel (unsigned char* pictureIn, unsigned char* result)
{
__shared__ unsigned char sBlockA[16][16]; // 1. shared memory for reverse swap
__shared__ unsigned char sBlockB[16][16]; // 2. shared memory for reverse swap
int tx = threadIdx.x; // thread index X-Dir
int ty = threadIdx.y; // thread index Y-Dir
int dX = gridDim.x * blockDim.x; // stride in X-Dir e.g. the pictures width
int dY = gridDim.y * blockDim.y; // stride in Y-Dir e.g. the pictures height
int blockIdxA = blockIdx.x * blockDim.x + blockIdx.y * blockDim.y * dX; // blockoffset to begin read
int blockIdxB = dX * dY - blockIdxA - blockDim.y * dX - blockDim.x; // blockoffset to store data
sBlockA[ty][tx] = pictureIn[blockIdxA + ty * dX + tx]; // linear data fetch from global memory
__syncthreads (); // wait for all threads to reach this point
sBlockB[ty][tx] = sBlockA[15-ty][15-tx]; // mirror each element in the cache
__syncthreads(); // wait for all threads to reach this point
result[blockIdxB + ty * dX + tx] = sBlockB[ty][tx]; // linear data store in global memory
}
Fetching and storing characters is lower bandwidth than other types, even when coalesced.
If possible, align the image to a 64-byte boundary, and pad the image so each row is aligned to 64 bytes.
Then if you cast the pointers to be int*, and perform the exact same algorithm (but with 1/4 the width of course) I think you will see much improvement.
Note the memory MUST be 4-byte aligned for casting to int* to be valid, otherwise you will get garbage. And it must be 64-byte aligned to get coalescing on 1.1 devices.
You can remove the second shared memory array “sBlockB”, when you write back to “result” just read from “sBlockA” using “sBlockA[15-ty][15-tx]”. This will save you a read and write to shared memory and a syncthreads().