Misaligned Data Access Has No Effect on Performance?

Hi all,
We are trying to improve performance of some CUDA codes. One promising way is to correct “Misaligned Data Accesses”. However, when we tried offsetCopy function in CUDA C Best Practices Guide https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#effects-of-misaligned-accesses, we notice that misaligned memory access has no effect on performance. Please see the following code and attached screen shot. Our platform is Jetson Tx2.
The memory access efficiency really drops to 50% when offset is 1, but it seems no harm to performance. The total kernel duration doesn’t become longer. Why?


__global__ void offsetCopy(char *odata, char* idata, int offset) {
	int xid = blockIdx.x * blockDim.x + threadIdx.x + offset;
	odata[xid] = idata[xid];
}

int main(int argc, char **argv) {
        //some initialization code

	const int MemorySize = 60*1024*1024;
	const int max_offset = 17;
	char *d_idata[max_offset], *d_odata[max_offset];

	printf("Allocating memory...\n");
	for(int i = 0; i < max_offset; i++) {
		checkCudaErrors(cudaMalloc((void ** )&d_idata[i], MemorySize*sizeof(char)));
		checkCudaErrors(cudaMalloc((void ** )&d_odata[i], MemorySize*sizeof(char)));
	}

	
	//warming up
	offsetCopy<<<1024, 32>>>(d_odata[max_offset-1], d_idata[max_offset-1], 0);


	for(int i = 0; i < max_offset; i++)
		offsetCopy<<<1024*10, 32>>>(d_odata[i], d_idata[i], i);


	for(int i = 0; i < max_offset; i++) {
		checkCudaErrors(cudaFree(d_idata[i]));
		checkCudaErrors(cudaFree(d_odata[i]));
	}

Copying one byte per thread is not efficient, regardless of alignment. This does not meet memory coalescing rules for any CUDA architecture (as far as I know). You would have to switch to 32bit integer reads. An entire warp therefore reads or writes 32*sizeof(int) bytes.

Integer access always requires reads and writes to be located at memory addresses that are multiples of 4 bytes, with ideal coalescing happening when the entire warp sequentially accesses data beginning at an address that is a multiple of 32*sizeof(int) = 128 bytes.

Not reading from multiples of 128 bytes in sequential reads of int values will result in two memory transactions being generated instead of one ever since the Fermi GPU generation. This is still quite efficient. With ancient GPUs predating Fermi a warp misalignment would have prevented coalescing altogether.

If you need to read/write from arbitray byte offsets, consider using warp shuffle or shared memory in combination with the funnel shifter to generate any required byte alignment. Your reads/writes would always follow the alignment requirements for integer types, but you would shift the data around in registers to achieve the intended byte offset.

It’s possible to use textures for doing unaligned reads (even for individual bytes) with fairly good throughput. However for writes you still have to follow coalescing rules - and these require the use of int types (or int2, int4 vector types)

Because the L1 cache on modern GPUs fixes the misaligned access issue for this particular case. You can discover the drop in “efficiency” by profiling your code with a metric such as gld_efficiency, but the drop to 50% in this case does not have much effect on performance, because the L1 cache “fixes” the issue, at least for bulk data transfer.

On recent GPUs, 1 byte per thread can still “coalesce” properly, meaning the memory controller will identify that the requested addresses per thread in the warp belong to the same line/segment (assuming the byte addresses are somehow adjacent), and “coalesce” the requested data into a single (or a minimum number of) transaction(s).

Furthermore, on recent GPUs, the 128 byte expectation for a global load is “relaxed” at the L1 cache, meaning that a “partial” requirement of only 32 bytes will result in only a 32-byte transaction being issued from L1 to L2 (and thereafter possibly to DRAM, if it misses in L2). This particular behavior varies by GPU architecture, but in Maxwell and beyond, behavior of this type can be observed with careful use of test case construction and the profiler.

Having said that, it will still generally be possible to achieve higher main memory bandwidth utilization by requesting larger amounts of data per global load transaction, such as 32 bits per thread, or even up to 128 bits per thread, the CUDA maximum for a single transaction/load instruction.

thanks txbob, I was not taking into account the caches. Some of my knowledge seems a bit outdated then.

But these caches would have no accelerating effect on writes, is this assumption correct?

I think that is correct. L1 is a write-through cache on all architectures AFAIK (feel free to correct me if you know otherwise). Therefore the L1 will not “fix” the misalignment for writes. However, AFAIK, L2 is a write-back cache on all architectures, therefore the transaction efficiency would probably be “restored” (assuming reasonable temporal locality in the code) for any transactions emanating from L2 to DRAM, for bulk data transfer on writes.

I use the term “bulk” here to refer to data transfer associated with a contiguous logical entity, such as a “vector” (or matrix). Scattered access, that “straddles” line/segment boundaries, will still incur a noticeably worse efficiency associated with the 50% number (as opposed to scattered access that does not straddle line/segment boundaries).

hi txbob,
Thank you. One more thing, what does “modern GPU” mean?GPU above sm 20?sm 35? Because I notice lots of basic GPU performance improvement method is useless due to modernity of GPU.

Any GPU with a L1 cache is “modern” for that statement. The only GPU generation that did not was the cc1.x generation (the first generation).