Coalesced acces slower than non coalesced

Cristobal_Vergara_Niederm · February 6, 2011, 5:12pm

Dear CUDA community,

I implemented these two kernels for swapping the red and green channels of an image, in the first one the memory access is coalesced, in the second one it’s not:

__global__ void gpu_swapRG_coalesced(uint8* raster, const uint32 npixels) {

	int i = 3 * blockIdx.x * blockDim.x + threadIdx.x;

	if(i < npixels*3) {

		__shared__ uint8 s_data[BLOCKDIM * 3];

		s_data[threadIdx.x] = *(raster + i);

		s_data[threadIdx.x + BLOCKDIM] = *(raster + i + BLOCKDIM);

		s_data[threadIdx.x + 2*BLOCKDIM] = *(raster + i + 2*BLOCKDIM);

		__syncthreads(); // because the red and green threads are used simultaneously

		uint8 aux;

		aux = s_data[threadIdx.x * 3 + 1]; // aux = green channel

		s_data[threadIdx.x * 3 + 1] = s_data[threadIdx.x * 3]; // green channel = red channel

		s_data[threadIdx.x * 3] = aux; // red channel = old green channel

		__syncthreads(); // threads could be copying pixels that are half or not swapped.

		*(raster + i) = s_data[threadIdx.x];

		*(raster + i + BLOCKDIM) = s_data[threadIdx.x + BLOCKDIM];

		*(raster + i + 2*BLOCKDIM) = s_data[threadIdx.x + 2*BLOCKDIM];

	}

}

Non coalesced memory access version:

__global__ void gpu_swapRG_not_coalesced(uint8* raster, const uint32 npixels) {

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	if(i < npixels) {

		uint8 aux;

		aux = *(raster + i * 3 + 1);

		*(raster + i * 3 + 1) = *(raster + i * 3);

		*(raster + i * 3) = aux;

	}

}

The non coalesced version is much faster than the coalesced one. I removed the

__syncthreads();

to check whether this was the problem, but there was no performance difference. Later I realized that the three writes at the end of the coalesced function is what’s taking so long.

Does anyone have an explanation to this?

Cristobal

avidday · February 6, 2011, 5:22pm

Probably shared memory bank conflicts. The limitations of shared memory access for 8 and 16 bit types is discussed in Section G.3.3 of the current programming guide. Incidentally, which card are you running this code on?

tera · February 6, 2011, 6:57pm

Could also be because the second kernel does only 4 (though wider) memory transactions where the first one does 6.
I’d guess the first kernel would only be faster on devices of compute capability 1.0 or 1.1.

sergeyn · February 7, 2011, 12:51am

I think both of your kernels are not coalesced. To get coalesced access you should read from an aligned memory pointer.
Consider sm_13 or later, the first read in the first kernel is actually 2 transactions for a half-warp, and, since you have 3 reads and 3 writes - you get 12 transactions per half-warp total.
In the 2nd case you have a half-wrap serviced in 2 transactions per mem. access, so you get 4*2 = 8 in total.

As mentioned earlier by avidday, on top of that you have bank conflicts, but they shouldn’t matter since you are memory bound here.
Also, I do not think making __syncthreads inside a conditional branch based on threadId (not blockId) is a valid thing to do.

LSChien · February 7, 2011, 1:15am

how large is your image?

Do you reach 100% utilization? (use all SMs)

Topic		Replies	Views
understanding (half-)wraps CUDA Programming and Performance	9	4554	October 27, 2010
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	0	622	July 12, 2011
Coalescing access CUDA Programming and Performance	3	801	March 2, 2012
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	3	3567	July 12, 2011
Coalesced Memory access related doubt CUDA Programming and Performance	13	2199	December 9, 2010
Memory coalescing in one thread CUDA Programming and Performance	17	16769	March 31, 2011
Need some help to understand how to coalesce memory access CUDA Programming and Performance	4	1050	June 30, 2010
Is these way coalesced access? CUDA Programming and Performance	0	419	March 6, 2020
questions about coalescing access coalescing access CUDA Programming and Performance	8	2084	November 13, 2009
32 byte coalesced access is faster than 128 byte coalesced access? CUDA Programming and Performance	3	1135	October 12, 2021

Coalesced acces slower than non coalesced

Related topics