Misaligned Data Access Has No Effect on Performance?

wtiandong · May 23, 2018, 10:25am

Hi all,
We are trying to improve performance of some CUDA codes. One promising way is to correct “Misaligned Data Accesses”. However, when we tried offsetCopy function in CUDA C Best Practices Guide https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#effects-of-misaligned-accesses, we notice that misaligned memory access has no effect on performance. Please see the following code and attached screen shot. Our platform is Jetson Tx2.
The memory access efficiency really drops to 50% when offset is 1, but it seems no harm to performance. The total kernel duration doesn’t become longer. Why?

__global__ void offsetCopy(char *odata, char* idata, int offset) {
	int xid = blockIdx.x * blockDim.x + threadIdx.x + offset;
	odata[xid] = idata[xid];
}

int main(int argc, char **argv) {
        //some initialization code

	const int MemorySize = 60*1024*1024;
	const int max_offset = 17;
	char *d_idata[max_offset], *d_odata[max_offset];

	printf("Allocating memory...\n");
	for(int i = 0; i < max_offset; i++) {
		checkCudaErrors(cudaMalloc((void ** )&d_idata[i], MemorySize*sizeof(char)));
		checkCudaErrors(cudaMalloc((void ** )&d_odata[i], MemorySize*sizeof(char)));
	}

	
	//warming up
	offsetCopy<<<1024, 32>>>(d_odata[max_offset-1], d_idata[max_offset-1], 0);


	for(int i = 0; i < max_offset; i++)
		offsetCopy<<<1024*10, 32>>>(d_odata[i], d_idata[i], i);


	for(int i = 0; i < max_offset; i++) {
		checkCudaErrors(cudaFree(d_idata[i]));
		checkCudaErrors(cudaFree(d_odata[i]));
	}

cbuchner1 · May 23, 2018, 10:41am

Copying one byte per thread is not efficient, regardless of alignment. This does not meet memory coalescing rules for any CUDA architecture (as far as I know). You would have to switch to 32bit integer reads. An entire warp therefore reads or writes 32*sizeof(int) bytes.

Integer access always requires reads and writes to be located at memory addresses that are multiples of 4 bytes, with ideal coalescing happening when the entire warp sequentially accesses data beginning at an address that is a multiple of 32*sizeof(int) = 128 bytes.

Not reading from multiples of 128 bytes in sequential reads of int values will result in two memory transactions being generated instead of one ever since the Fermi GPU generation. This is still quite efficient. With ancient GPUs predating Fermi a warp misalignment would have prevented coalescing altogether.

If you need to read/write from arbitray byte offsets, consider using warp shuffle or shared memory in combination with the funnel shifter to generate any required byte alignment. Your reads/writes would always follow the alignment requirements for integer types, but you would shift the data around in registers to achieve the intended byte offset.

It’s possible to use textures for doing unaligned reads (even for individual bytes) with fairly good throughput. However for writes you still have to follow coalescing rules - and these require the use of int types (or int2, int4 vector types)

Robert_Crovella · May 23, 2018, 10:42am

Because the L1 cache on modern GPUs fixes the misaligned access issue for this particular case. You can discover the drop in “efficiency” by profiling your code with a metric such as gld_efficiency, but the drop to 50% in this case does not have much effect on performance, because the L1 cache “fixes” the issue, at least for bulk data transfer.

Robert_Crovella · May 23, 2018, 10:45am

On recent GPUs, 1 byte per thread can still “coalesce” properly, meaning the memory controller will identify that the requested addresses per thread in the warp belong to the same line/segment (assuming the byte addresses are somehow adjacent), and “coalesce” the requested data into a single (or a minimum number of) transaction(s).

Furthermore, on recent GPUs, the 128 byte expectation for a global load is “relaxed” at the L1 cache, meaning that a “partial” requirement of only 32 bytes will result in only a 32-byte transaction being issued from L1 to L2 (and thereafter possibly to DRAM, if it misses in L2). This particular behavior varies by GPU architecture, but in Maxwell and beyond, behavior of this type can be observed with careful use of test case construction and the profiler.

Having said that, it will still generally be possible to achieve higher main memory bandwidth utilization by requesting larger amounts of data per global load transaction, such as 32 bits per thread, or even up to 128 bits per thread, the CUDA maximum for a single transaction/load instruction.

cbuchner1 · May 23, 2018, 11:02am

thanks txbob, I was not taking into account the caches. Some of my knowledge seems a bit outdated then.

But these caches would have no accelerating effect on writes, is this assumption correct?

Robert_Crovella · May 23, 2018, 11:07am

I think that is correct. L1 is a write-through cache on all architectures AFAIK (feel free to correct me if you know otherwise). Therefore the L1 will not “fix” the misalignment for writes. However, AFAIK, L2 is a write-back cache on all architectures, therefore the transaction efficiency would probably be “restored” (assuming reasonable temporal locality in the code) for any transactions emanating from L2 to DRAM, for bulk data transfer on writes.

I use the term “bulk” here to refer to data transfer associated with a contiguous logical entity, such as a “vector” (or matrix). Scattered access, that “straddles” line/segment boundaries, will still incur a noticeably worse efficiency associated with the 50% number (as opposed to scattered access that does not straddle line/segment boundaries).

wtiandong · May 23, 2018, 11:32pm

hi txbob,
Thank you. One more thing, what does “modern GPU” mean？GPU above sm 20？sm 35? Because I notice lots of basic GPU performance improvement method is useless due to modernity of GPU.

Robert_Crovella · May 24, 2018, 1:16am

Any GPU with a L1 cache is “modern” for that statement. The only GPU generation that did not was the cc1.x generation (the first generation).

Topic		Replies	Views
How to Access Global Memory Efficiently in CUDA C/C++ Kernels Technical Blog	7	644	December 5, 2019
What is the expected L1/L2 hit rate for fully coalesced accesses? CUDA Programming and Performance	10	109	January 8, 2025
How does "semi-misaligned" (a mixture of strided and misaligned) access to global memory affect performance? CUDA Programming and Performance	3	374	February 16, 2022
Require clarification for Memory coalescing? CUDA Programming and Performance hw , cuda	4	1765	October 5, 2023
Global memory access bottleneck CUDA Programming and Performance	8	3458	September 4, 2015
Memory access should be coalesced but is not CUDA Programming and Performance	6	1066	May 16, 2019
About coalescing CUDA Programming and Performance	6	2619	April 16, 2010
Strided vs non-strided access CUDA Programming and Performance	6	592	May 21, 2024
Zero Copy performance problem CUDA Programming and Performance	4	2075	July 6, 2021
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	5953	October 29, 2008

Misaligned Data Access Has No Effect on Performance?

Related topics