Coalesced acces slower than non coalesced

Dear CUDA community,

I implemented these two kernels for swapping the red and green channels of an image, in the first one the memory access is coalesced, in the second one it’s not:

__global__ void gpu_swapRG_coalesced(uint8* raster, const uint32 npixels) {

	int i = 3 * blockIdx.x * blockDim.x + threadIdx.x;

	if(i < npixels*3) {

		__shared__ uint8 s_data[BLOCKDIM * 3];

		s_data[threadIdx.x] = *(raster + i);

		s_data[threadIdx.x + BLOCKDIM] = *(raster + i + BLOCKDIM);

		s_data[threadIdx.x + 2*BLOCKDIM] = *(raster + i + 2*BLOCKDIM);

		__syncthreads(); // because the red and green threads are used simultaneously

		uint8 aux;

		aux = s_data[threadIdx.x * 3 + 1]; // aux = green channel

		s_data[threadIdx.x * 3 + 1] = s_data[threadIdx.x * 3]; // green channel = red channel

		s_data[threadIdx.x * 3] = aux; // red channel = old green channel

		__syncthreads(); // threads could be copying pixels that are half or not swapped.

		*(raster + i) = s_data[threadIdx.x];

		*(raster + i + BLOCKDIM) = s_data[threadIdx.x + BLOCKDIM];

		*(raster + i + 2*BLOCKDIM) = s_data[threadIdx.x + 2*BLOCKDIM];



Non coalesced memory access version:

__global__ void gpu_swapRG_not_coalesced(uint8* raster, const uint32 npixels) {

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	if(i < npixels) {

		uint8 aux;

		aux = *(raster + i * 3 + 1);

		*(raster + i * 3 + 1) = *(raster + i * 3);

		*(raster + i * 3) = aux;



The non coalesced version is much faster than the coalesced one. I removed the


to check whether this was the problem, but there was no performance difference. Later I realized that the three writes at the end of the coalesced function is what’s taking so long.

Does anyone have an explanation to this?


Probably shared memory bank conflicts. The limitations of shared memory access for 8 and 16 bit types is discussed in Section G.3.3 of the current programming guide. Incidentally, which card are you running this code on?

Could also be because the second kernel does only 4 (though wider) memory transactions where the first one does 6.
I’d guess the first kernel would only be faster on devices of compute capability 1.0 or 1.1.

I think both of your kernels are not coalesced. To get coalesced access you should read from an aligned memory pointer.
Consider sm_13 or later, the first read in the first kernel is actually 2 transactions for a half-warp, and, since you have 3 reads and 3 writes - you get 12 transactions per half-warp total.
In the 2nd case you have a half-wrap serviced in 2 transactions per mem. access, so you get 4*2 = 8 in total.

As mentioned earlier by avidday, on top of that you have bank conflicts, but they shouldn’t matter since you are memory bound here.
Also, I do not think making __syncthreads inside a conditional branch based on threadId (not blockId) is a valid thing to do.

how large is your image?

Do you reach 100% utilization? (use all SMs)