Do these two global memory coalesced access pattern have same performance in thoery?

Hi all, I’m now having two gpu kernels with same code structure but only different in global memory to shared memory access pattern, the first kernel’s access pattern is

 *(int4 *)(A_shared + (((((int)threadIdx.y) * 1024) + (((int)threadIdx.z) * 512)) + (((int)threadIdx.x) * 16))) = *(int4 *)(A + ((((((((int)blockIdx.y) * 1048576) + (((int)threadIdx.y) * 524288)) + (((int)threadIdx.z) * 262144)) + ((((int)threadIdx.x) & 15) * 16384)) + (k_0_0 * 32)) + ((((int)threadIdx.x) >> 4) * 16)));

the access list of the first warp:

tid 0, access A: [0][0] ~ [0][16] 
tid 1, access A: [1][0] ~ [1][16] 
tid 2, access A: [2][0] ~ [2][16] 
tid 3, access A: [3][0] ~ [3][16] 
tid 4, access A: [4][0] ~ [4][16] 
tid 5, access A: [5][0] ~ [5][16] 
tid 6, access A: [6][0] ~ [6][16] 
tid 7, access A: [7][0] ~ [7][16] 
tid 8, access A: [8][0] ~ [8][16] 
tid 9, access A: [9][0] ~ [9][16] 
tid 10, access A: [10][0] ~ [10][16] 
tid 11, access A: [11][0] ~ [11][16] 
tid 12, access A: [12][0] ~ [12][16] 
tid 13, access A: [13][0] ~ [13][16] 
tid 14, access A: [14][0] ~ [14][16] 
tid 15, access A: [15][0] ~ [15][16] 
tid 16, access A: [0][16] ~ [0][32] 
tid 17, access A: [1][16] ~ [1][32] 
tid 18, access A: [2][16] ~ [2][32] 
tid 19, access A: [3][16] ~ [3][32] 
tid 20, access A: [4][16] ~ [4][32] 
tid 21, access A: [5][16] ~ [5][32] 
tid 22, access A: [6][16] ~ [6][32] 
tid 23, access A: [7][16] ~ [7][32] 
tid 24, access A: [8][16] ~ [8][32] 
tid 25, access A: [9][16] ~ [9][32] 
tid 26, access A: [10][16] ~ [10][32] 
tid 27, access A: [11][16] ~ [11][32] 
tid 28, access A: [12][16] ~ [12][32] 
tid 29, access A: [13][16] ~ [13][32] 
tid 30, access A: [14][16] ~ [14][32] 
tid 31, access A: [15][16] ~ [15][32] 

And the second kernel’s access pattern is

*(int4 *)(A_shared + (((((int)threadIdx.y) * 1024) + (((int)threadIdx.z) * 512)) + (((int)threadIdx.x) * 16))) = *(int4 *)(A + (((((((int)blockIdx.y) * 1048576) + (((int)threadIdx.y) * 524288)) + (((int)threadIdx.z) * 262144)) + (kk_0 * 512)) + (((int)threadIdx.x) * 16)));

the access list of the first warp:

Problem Size : M 16384 N 16384 K 16384
tid 0, access A: [0][0] ~ [0][16] 
tid 1, access A: [0][16] ~ [0][32] 
tid 2, access A: [0][32] ~ [0][48] 
tid 3, access A: [0][48] ~ [0][64] 
tid 4, access A: [0][64] ~ [0][80] 
tid 5, access A: [0][80] ~ [0][96] 
tid 6, access A: [0][96] ~ [0][112] 
tid 7, access A: [0][112] ~ [0][128] 
tid 8, access A: [0][128] ~ [0][144] 
tid 9, access A: [0][144] ~ [0][160] 
tid 10, access A: [0][160] ~ [0][176] 
tid 11, access A: [0][176] ~ [0][192] 
tid 12, access A: [0][192] ~ [0][208] 
tid 13, access A: [0][208] ~ [0][224] 
tid 14, access A: [0][224] ~ [0][240] 
tid 15, access A: [0][240] ~ [0][256] 
tid 16, access A: [0][256] ~ [0][272] 
tid 17, access A: [0][272] ~ [0][288] 
tid 18, access A: [0][288] ~ [0][304] 
tid 19, access A: [0][304] ~ [0][320] 
tid 20, access A: [0][320] ~ [0][336] 
tid 21, access A: [0][336] ~ [0][352] 
tid 22, access A: [0][352] ~ [0][368] 
tid 23, access A: [0][368] ~ [0][384] 
tid 24, access A: [0][384] ~ [0][400] 
tid 25, access A: [0][400] ~ [0][416] 
tid 26, access A: [0][416] ~ [0][432] 
tid 27, access A: [0][432] ~ [0][448] 
tid 28, access A: [0][448] ~ [0][464] 
tid 29, access A: [0][464] ~ [0][480] 
tid 30, access A: [0][480] ~ [0][496] 
tid 31, access A: [0][496] ~ [0][512]

Assume array A is a 16384x16384 matrix, from my understanding if thread in a warp can coalesced at least 32 bytes, they can leverage dram burst and fill the bandwith, but the first access pattern is more unefficient, takes a giant overhead, anybody can explain it?

The code is test under a single 24GB 3090 gpu, with ubuntu 20.04 and CUDA 11.1 installed.

I haven’t worked through your indexing math to verify that the access patterns you present are indeed accurate, so I will base this on the assumption that your stated indexing does imply your stated access pattern. I do have some reason to believe just based on indexing that the first pattern will be bad, due to this:

That is generally a bad formula. To arrange for coalesced loads/stores, we prefer not to multiply threadIdx.x by anything. It’s best if that is an additive factor. But that is certainly not a complete analysis of your indexing. So moving on to your stated access patterns.

I’m not sure what that means, but I doubt it is a sensible way to expect coalescing. Coalescing never refers to “a thread”, but always considers the behavior of adjacent threads. With that in mind, let’s also note that you are loading an int4 per thread, so 512 bytes, when considered warp-wide. When the request across the warp is for more than 128 bytes, the memory controller will always break that up into multiple transactions, a single transaction being no more than 128 bytes. So the request will be decomposed into 4 transactions, each for 1/4 of the warp. Therefore we must look at the behavior of threads 0-7, and 8-15, and 16-23, and 24-31. Let’s start with threads 0-7. Considering just those threads, the indicated access for the first case is:

Do those represent adjacent memory regions? They do not (given that your array is 16384 x 16384). So that is a lousy and inefficient load pattern. It will require further deconstruction into 8 separate loads, as presented to memory.

Looking at the second access pattern:

That does represent a completely adjacent load, all along a single “row”. So that will coalesce nicely and the memory controller can do that with a single load from DRAM.

1 Like

Thanks for your comment , this really helps me a lot. @Robert_Crovella .

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.