Do these two global memory coalesced access pattern have same performance in thoery?

wanglei21c · December 17, 2022, 9:45am

Hi all, I’m now having two gpu kernels with same code structure but only different in global memory to shared memory access pattern, the first kernel’s access pattern is

 *(int4 *)(A_shared + (((((int)threadIdx.y) * 1024) + (((int)threadIdx.z) * 512)) + (((int)threadIdx.x) * 16))) = *(int4 *)(A + ((((((((int)blockIdx.y) * 1048576) + (((int)threadIdx.y) * 524288)) + (((int)threadIdx.z) * 262144)) + ((((int)threadIdx.x) & 15) * 16384)) + (k_0_0 * 32)) + ((((int)threadIdx.x) >> 4) * 16)));

the access list of the first warp:

tid 0, access A: [0][0] ~ [0][16] 
tid 1, access A: [1][0] ~ [1][16] 
tid 2, access A: [2][0] ~ [2][16] 
tid 3, access A: [3][0] ~ [3][16] 
tid 4, access A: [4][0] ~ [4][16] 
tid 5, access A: [5][0] ~ [5][16] 
tid 6, access A: [6][0] ~ [6][16] 
tid 7, access A: [7][0] ~ [7][16] 
tid 8, access A: [8][0] ~ [8][16] 
tid 9, access A: [9][0] ~ [9][16] 
tid 10, access A: [10][0] ~ [10][16] 
tid 11, access A: [11][0] ~ [11][16] 
tid 12, access A: [12][0] ~ [12][16] 
tid 13, access A: [13][0] ~ [13][16] 
tid 14, access A: [14][0] ~ [14][16] 
tid 15, access A: [15][0] ~ [15][16] 
tid 16, access A: [0][16] ~ [0][32] 
tid 17, access A: [1][16] ~ [1][32] 
tid 18, access A: [2][16] ~ [2][32] 
tid 19, access A: [3][16] ~ [3][32] 
tid 20, access A: [4][16] ~ [4][32] 
tid 21, access A: [5][16] ~ [5][32] 
tid 22, access A: [6][16] ~ [6][32] 
tid 23, access A: [7][16] ~ [7][32] 
tid 24, access A: [8][16] ~ [8][32] 
tid 25, access A: [9][16] ~ [9][32] 
tid 26, access A: [10][16] ~ [10][32] 
tid 27, access A: [11][16] ~ [11][32] 
tid 28, access A: [12][16] ~ [12][32] 
tid 29, access A: [13][16] ~ [13][32] 
tid 30, access A: [14][16] ~ [14][32] 
tid 31, access A: [15][16] ~ [15][32]

And the second kernel’s access pattern is

*(int4 *)(A_shared + (((((int)threadIdx.y) * 1024) + (((int)threadIdx.z) * 512)) + (((int)threadIdx.x) * 16))) = *(int4 *)(A + (((((((int)blockIdx.y) * 1048576) + (((int)threadIdx.y) * 524288)) + (((int)threadIdx.z) * 262144)) + (kk_0 * 512)) + (((int)threadIdx.x) * 16)));

the access list of the first warp:

Problem Size : M 16384 N 16384 K 16384
tid 0, access A: [0][0] ~ [0][16] 
tid 1, access A: [0][16] ~ [0][32] 
tid 2, access A: [0][32] ~ [0][48] 
tid 3, access A: [0][48] ~ [0][64] 
tid 4, access A: [0][64] ~ [0][80] 
tid 5, access A: [0][80] ~ [0][96] 
tid 6, access A: [0][96] ~ [0][112] 
tid 7, access A: [0][112] ~ [0][128] 
tid 8, access A: [0][128] ~ [0][144] 
tid 9, access A: [0][144] ~ [0][160] 
tid 10, access A: [0][160] ~ [0][176] 
tid 11, access A: [0][176] ~ [0][192] 
tid 12, access A: [0][192] ~ [0][208] 
tid 13, access A: [0][208] ~ [0][224] 
tid 14, access A: [0][224] ~ [0][240] 
tid 15, access A: [0][240] ~ [0][256] 
tid 16, access A: [0][256] ~ [0][272] 
tid 17, access A: [0][272] ~ [0][288] 
tid 18, access A: [0][288] ~ [0][304] 
tid 19, access A: [0][304] ~ [0][320] 
tid 20, access A: [0][320] ~ [0][336] 
tid 21, access A: [0][336] ~ [0][352] 
tid 22, access A: [0][352] ~ [0][368] 
tid 23, access A: [0][368] ~ [0][384] 
tid 24, access A: [0][384] ~ [0][400] 
tid 25, access A: [0][400] ~ [0][416] 
tid 26, access A: [0][416] ~ [0][432] 
tid 27, access A: [0][432] ~ [0][448] 
tid 28, access A: [0][448] ~ [0][464] 
tid 29, access A: [0][464] ~ [0][480] 
tid 30, access A: [0][480] ~ [0][496] 
tid 31, access A: [0][496] ~ [0][512]

Assume array A is a 16384x16384 matrix, from my understanding if thread in a warp can coalesced at least 32 bytes, they can leverage dram burst and fill the bandwith, but the first access pattern is more unefficient, takes a giant overhead, anybody can explain it?

The code is test under a single 24GB 3090 gpu, with ubuntu 20.04 and CUDA 11.1 installed.

Robert_Crovella · December 17, 2022, 5:01pm

I haven’t worked through your indexing math to verify that the access patterns you present are indeed accurate, so I will base this on the assumption that your stated indexing does imply your stated access pattern. I do have some reason to believe just based on indexing that the first pattern will be bad, due to this:

That is generally a bad formula. To arrange for coalesced loads/stores, we prefer not to multiply threadIdx.x by anything. It’s best if that is an additive factor. But that is certainly not a complete analysis of your indexing. So moving on to your stated access patterns.

I’m not sure what that means, but I doubt it is a sensible way to expect coalescing. Coalescing never refers to “a thread”, but always considers the behavior of adjacent threads. With that in mind, let’s also note that you are loading an int4 per thread, so 512 bytes, when considered warp-wide. When the request across the warp is for more than 128 bytes, the memory controller will always break that up into multiple transactions, a single transaction being no more than 128 bytes. So the request will be decomposed into 4 transactions, each for 1/4 of the warp. Therefore we must look at the behavior of threads 0-7, and 8-15, and 16-23, and 24-31. Let’s start with threads 0-7. Considering just those threads, the indicated access for the first case is:

wanglei21c:

tid 0, access A: [0][0] ~ [0][16] 
tid 1, access A: [1][0] ~ [1][16] 
tid 2, access A: [2][0] ~ [2][16] 
tid 3, access A: [3][0] ~ [3][16] 
tid 4, access A: [4][0] ~ [4][16] 
tid 5, access A: [5][0] ~ [5][16] 
tid 6, access A: [6][0] ~ [6][16] 
tid 7, access A: [7][0] ~ [7][16]

Do those represent adjacent memory regions? They do not (given that your array is 16384 x 16384). So that is a lousy and inefficient load pattern. It will require further deconstruction into 8 separate loads, as presented to memory.

Looking at the second access pattern:

wanglei21c:

tid 0, access A: [0][0] ~ [0][16] 
tid 1, access A: [0][16] ~ [0][32] 
tid 2, access A: [0][32] ~ [0][48] 
tid 3, access A: [0][48] ~ [0][64] 
tid 4, access A: [0][64] ~ [0][80] 
tid 5, access A: [0][80] ~ [0][96] 
tid 6, access A: [0][96] ~ [0][112] 
tid 7, access A: [0][112] ~ [0][128]

That does represent a completely adjacent load, all along a single “row”. So that will coalesce nicely and the memory controller can do that with a single load from DRAM.

wanglei21c · December 17, 2022, 5:10pm

Thanks for your comment , this really helps me a lot. @Robert_Crovella .

system · December 31, 2022, 5:10pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Coalesced Memory access related doubt CUDA Programming and Performance	13	2220	December 9, 2010
How to understand the alignment of 2D array and fully coalesce of the memory access CUDA Programming and Performance	7	3621	July 27, 2016
32 byte coalesced access is faster than 128 byte coalesced access? CUDA Programming and Performance	3	1147	October 12, 2021
Memory coalescing in one thread CUDA Programming and Performance	17	16779	March 31, 2011
Memory access should be coalesced but is not CUDA Programming and Performance	6	1175	May 16, 2019
Profiling coalesced memory accesses confusion Nsight Compute	2	679	October 12, 2021
coalescing problem CUDA Programming and Performance	4	1129	August 8, 2011
Coalescence CUDA Programming and Performance	3	822	January 9, 2018
Is these way coalesced access? CUDA Programming and Performance	0	420	March 6, 2020
questions about coalescing access coalescing access CUDA Programming and Performance	8	2085	November 13, 2009

Do these two global memory coalesced access pattern have same performance in thoery?

Related topics