Memory access should be coalesced but is not

Hello,

thank you for your support in advance.

I have been working with CUDA and I came across a strange behaviour which I did not expect regarding memory coalescence.

In short, my cuda kernel receives two parameters (1) an array of unsigned long long int and (2) a char sequence.
Now each thread must process contiguous segments of 32 chars, with an offset of 1 char each thread. That is, thread 0 will compute segment from chars 0-31, thread 1 segment 1-32, etc.

The thing is I am expecting a coalescent access since for each iteration, each thread will access a char that is sequential one thread to another, up to 32, which is the warp size.

My Kernel (at the moment the code does not make much sense because I have reduced its complexity to try to pinpoint the problem):

__global__ void kernel_index_global_any_simplest(unsigned long long int * table, const char * sequence) {

	unsigned long long int k, hash = 0;
	char c;

	for(k=0; k<32; k++){

		c = sequence[threadIdx.x + k + blockIdx.x * blockDim.x];
		
		hash += c;
		
	}

	table[threadIdx.x + blockIdx.x * blockDim.x] = hash;
}

However when I run nvprof -f --analysis-metrics --export-profile and put it into nvidia visual profiler (version 10) and run the global memory access pattern I get that 2 transactions are being performed instead of 1, when as far as I know the transaction size can be up to 4 bytes per 32 threads = 128 bytes. In this case, I am only requiring 32 contiguous bytes, which I expected to be only one transaction as it can fetch a whole 128 byte line.

I am also posting the disassembly:

_Z32kernel_index_global_any_simplestPyPKc:
		MOV R1, c[0x0][0x20] ;
		S2R R5, SR_CTAID.X ;
		S2R R0, SR_TID.X ;
		XMAD R2, R5.reuse, c[0x0] [0x8], RZ ;
		XMAD.MRG R3, R5.reuse, c[0x0] [0x8].H1, RZ ;
		XMAD.PSL.CBCC R5, R5.H1, R3.H1, R2 ;
		IADD3 R2.CC, R0, c[0x0][0x148], R5 ;
		IADD3.X R3, RZ, c[0x0][0x14c], RZ ;
		LDG.E.S8 R17, [R2] ;
		LDG.E.S8 R18, [R2+0x1] ;
		LDG.E.S8 R20, [R2+0x2] ;
		LDG.E.S8 R19, [R2+0x3] ;
		LDG.E.S8 R21, [R2+0x4] ;
		LDG.E.S8 R23, [R2+0x5] ;
		LDG.E.S8 R24, [R2+0x6] ;
		LDG.E.S8 R25, [R2+0x7] ;
		LDG.E.S8 R26, [R2+0x8] ;
		LDG.E.S8 R4, [R2+0x9] ;
		LDG.E.S8 R6, [R2+0xa] ;
		LDG.E.S8 R7, [R2+0xb] ;
		LDG.E.S8 R8, [R2+0xc] ;
		LDG.E.S8 R9, [R2+0xd] ;
		LDG.E.S8 R10, [R2+0xe] ;
		LDG.E.S8 R11, [R2+0xf] ;
		LDG.E.S8 R12, [R2+0x10] ;
		LDG.E.S8 R13, [R2+0x11] ;
		LDG.E.S8 R14, [R2+0x12] ;
		LDG.E.S8 R15, [R2+0x13] ;
		LDG.E.S8 R16, [R2+0x14] ;
		DEPBAR.LE SB5, 0x9 ;
		SHR R22, R17, 0x1f ;
		{         IADD3 R28.CC, R20.reuse, R18, R17 ;
		LDG.E.S8 R17, [R2+0x15]         }
		{         SHR R27, R18, 0x1f ;
		LDG.E.S8 R18, [R2+0x16]         }
		SHR R20, R20, 0x1f ;
		{         IADD3.X R27, R20, R27, R22 ;
		LDG.E.S8 R20, [R2+0x18]         }
		DEPBAR.LE SB5, 0x9 ;
		IADD3 R29.CC, R21, R19, R28 ;
		{         SHR R22, R19, 0x1f ;
		LDG.E.S8 R19, [R2+0x17]         }
		SHR R21, R21, 0x1f ;
		{         IADD3.X R30, R21, R22, R27 ;
		LDG.E.S8 R21, [R2+0x19]         }
		DEPBAR.LE SB5, 0x9 ;
		{         IADD3 R27.CC, R24, R23, R29 ;
		LDG.E.S8 R22, [R2+0x1a]         }
		SHR R23, R23, 0x1f ;
		SHR R24, R24, 0x1f ;
		{         IADD3.X R28, R24, R23, R30 ;
		LDG.E.S8 R23, [R2+0x1b]         }
		LDG.E.S8 R24, [R2+0x1c] ;
		DEPBAR.LE SB5, 0xa ;
		{         IADD3 R29.CC, R26, R25, R27 ;
		LDG.E.S8 R27, [R2+0x1e]         }
		SHR R25, R25, 0x1f ;
		SHR R26, R26, 0x1f ;
		{         IADD3.X R28, R26, R25, R28 ;
		LDG.E.S8 R26, [R2+0x1d]         }
		LDG.E.S8 R25, [R2+0x1f] ;
		DEPBAR.LE SB5, 0xb ;
		IADD3 R30.CC, R6, R4, R29 ;
		SHR R29, R4, 0x1f ;
		SHR R6, R6, 0x1f ;
		IADD3.X R6, R6, R29, R28 ;
		DEPBAR.LE SB5, 0xa ;
		IADD3 R4.CC, R8.reuse, R7, R30 ;
		SHR R7, R7, 0x1f ;
		SHR R8, R8, 0x1f ;
		IADD3.X R7, R8, R7, R6 ;
		DEPBAR.LE SB5, 0x9 ;
		IADD3 R6.CC, R10.reuse, R9, R4 ;
		SHR R9, R9, 0x1f ;
		SHR R10, R10, 0x1f ;
		IADD3.X R9, R10, R9, R7 ;
		DEPBAR.LE SB5, 0x8 ;
		IADD3 R3.CC, R12.reuse, R11, R6 ;
		SHR R11, R11, 0x1f ;
		SHR R12, R12, 0x1f ;
		IADD3.X R11, R12, R11, R9 ;
		DEPBAR.LE SB5, 0x7 ;
		IADD3 R2.CC, R14, R13, R3 ;
		SHR R13, R13, 0x1f ;
		SHR R14, R14, 0x1f ;
		IADD3.X R13, R14, R13, R11 ;
		DEPBAR.LE SB5, 0x6 ;
		IADD3 R3.CC, R16, R15, R2 ;
		SHR R15, R15, 0x1f ;
		SHR R16, R16, 0x1f ;
		IADD3.X R15, R16, R15, R13 ;
		IADD R5, R5, R0 ;
		SHR.U32 R0, R5, 0x1d ;
		DEPBAR.LE SB5, 0x5 ;
		IADD3 R2.CC, R18, R17, R3 ;
		SHR R17, R17, 0x1f ;
		SHR R18, R18, 0x1f ;
		IADD3.X R17, R18, R17, R15 ;
		DEPBAR.LE SB5, 0x4 ;
		IADD3 R3.CC, R20, R19, R2 ;
		SHR R19, R19, 0x1f ;
		SHR R20, R20, 0x1f ;
		IADD3.X R19, R20, R19, R17 ;
		DEPBAR.LE SB5, 0x3 ;
		IADD3 R2.CC, R22.reuse, R21, R3 ;
		SHR R21, R21, 0x1f ;
		SHR R22, R22, 0x1f ;
		IADD3.X R21, R22, R21, R19 ;
		DEPBAR.LE SB5, 0x2 ;
		IADD3 R8.CC, R24.reuse, R23, R2 ;
		SHR R23, R23, 0x1f ;
		SHR R24, R24, 0x1f ;
		IADD3.X R24, R24, R23, R21 ;
		ISCADD R4.CC, R5, c[0x0][0x140], 0x3 ;
		IADD.X R5, R0, c[0x0][0x144] ;
		DEPBAR.LE SB5, 0x1 ;
		IADD3 R2.CC, R27.reuse, R26, R8 ;
		SHR R26, R26, 0x1f ;
		SHR R27, R27, 0x1f ;
		IADD3.X R26, R27, R26, R24 ;
		IADD R2.CC, R25.reuse, R2 ;
		SHR R25, R25, 0x1f ;
		IADD.X R3, R25, R26 ;
		STG.E.64 [R4], R2 ;
		EXIT ;
.L_1:
		BRA `(.L_1) ;
.L_164:

In the disassembly you can see that it first performs up to 21 loads from global memory out of the 32. I have noticed that up to these 21 loads, the register pressure is nearly full, i.e. could it be that the compiler tried to save all 32 chars in different registers (for each thread) therefore filling up all registers before reaching the 32 loads, and therefore having to separate the single coalescent load into two separate ones - so that register pressure can be lowered (by computing the sum)?

If so, why does the compiler try to load as many chars as possible, without interleaving the arithmetic operations in between? Wouldn’t this result in a fully coalesced access?

(Other: I am running this with 256 threads per block)

Thank you very much for your help

The compiler aggressively unrolls loops and then reorders loads towards the beginning of your kernel. This is an attempt to optimize the latency-hiding (i.e. performance) characteristics of the machine, since a load operation by definition cannot result in a warp stall. This has no bearing on whether or not a particular load instruction/pattern across a warp will coalesce, or how that load request will map into transactions.

byte strings of arbitrary alignment may result in more than one transaction per load operation. memory segments and cache lines do not have arbitrary alignment. Their alignment is fixed in the address space. If your requested data spans across these line/segment boundaries, a particular load request may result in more than 1 transaction per request, regardless of how many bytes are needed for the load request, when viewed warp-wide.

Speaking with respect to global loads as viewed up to the point of the L1 cache, the way to achieve 1 transaction per request is to make sure the data you request across the warp for a load fits within a single 128 byte aligned region. And this may also become a bit more clouded with Maxwell and newer devices, which allow for less than 128 byte global load activity at the L1. Obviously a walking pattern such as yours cannot possibly satisfy this.

Having said all this, I wouldn’t worry too much about it from a performance perspective. The L1 cache is there to mitigate just such access patterns.

Thank you Robert.

I would be very thankful if you could help me clarify this question this further question:

  • Assuming a 128 byte aligned region and in the case that the compiler did not try to unroll the loop and reorder the loads, i.e. it stayed as a for loop with one load per iteration, would that mean that for the iterations that fetch bytes up to 128 the access should be coalesced, but not for the ones that e.g. fetched bytes 120-152 (since this region would not be aligned)? And so only memory loads that fit between aligned regions would be coalesced (like an alternating coalescence?). (Note: I am not trying to force coalescence, just trying to figure it out).

Thank you very much!

Coalesced means grouped.

When you have a global load instruction, the GPU will observe the addresses requested by each thread in the warp, and it will group or “coalesce” adjacent addresses together, because these addresses that are grouped or coalesced together can be satisfied by the same transaction to memory.

In that respect, coalescing occurs in both cases. However when the addresses span across a memory segment or cacheline boundary, the coalescing process will produce two or more groups, whereas if all the addresses fall within the same cacheline or memory segment, the coalescing process will produce just a single group, which can be satisfied by a single transaction.

Thank you very much Robert. It is now clear to me.

Sorry to reopen this, but I have one more question:

you said that the L1 cache would mitigate the effect of the access patterns, however, as I understand it, in Maxwell architectures the L1 cache does not cache global loads by default. I also found this:

https://stackoverflow.com/questions/28895133/what-is-l1-cache-used-for-in-nvidias-maxwell-gpus

Isn’t it then the L2 cache the one that caches the global loads? Am I missing something?

Thank you for your time

Correct, if the L1 cache is not enabled for a particular architecture (or as a result of a compiler switch, or even specific PTX coding choices) then it won’t provide any benefit here. In that case, the L2, which cannot be disabled, may perform the same function.