Dear CUDA community,
I implemented these two kernels for swapping the red and green channels of an image, in the first one the memory access is coalesced, in the second one it’s not:
__global__ void gpu_swapRG_coalesced(uint8* raster, const uint32 npixels) {
int i = 3 * blockIdx.x * blockDim.x + threadIdx.x;
if(i < npixels*3) {
__shared__ uint8 s_data[BLOCKDIM * 3];
s_data[threadIdx.x] = *(raster + i);
s_data[threadIdx.x + BLOCKDIM] = *(raster + i + BLOCKDIM);
s_data[threadIdx.x + 2*BLOCKDIM] = *(raster + i + 2*BLOCKDIM);
__syncthreads(); // because the red and green threads are used simultaneously
uint8 aux;
aux = s_data[threadIdx.x * 3 + 1]; // aux = green channel
s_data[threadIdx.x * 3 + 1] = s_data[threadIdx.x * 3]; // green channel = red channel
s_data[threadIdx.x * 3] = aux; // red channel = old green channel
__syncthreads(); // threads could be copying pixels that are half or not swapped.
*(raster + i) = s_data[threadIdx.x];
*(raster + i + BLOCKDIM) = s_data[threadIdx.x + BLOCKDIM];
*(raster + i + 2*BLOCKDIM) = s_data[threadIdx.x + 2*BLOCKDIM];
}
}
Non coalesced memory access version:
__global__ void gpu_swapRG_not_coalesced(uint8* raster, const uint32 npixels) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i < npixels) {
uint8 aux;
aux = *(raster + i * 3 + 1);
*(raster + i * 3 + 1) = *(raster + i * 3);
*(raster + i * 3) = aux;
}
}
The non coalesced version is much faster than the coalesced one. I removed the
__syncthreads();
to check whether this was the problem, but there was no performance difference. Later I realized that the three writes at the end of the coalesced function is what’s taking so long.
Does anyone have an explanation to this?
Cristobal