Hi all, I’m now having two gpu kernels with same code structure but only different in global memory to shared memory access pattern, the first kernel’s access pattern is
*(int4 *)(A_shared + (((((int)threadIdx.y) * 1024) + (((int)threadIdx.z) * 512)) + (((int)threadIdx.x) * 16))) = *(int4 *)(A + ((((((((int)blockIdx.y) * 1048576) + (((int)threadIdx.y) * 524288)) + (((int)threadIdx.z) * 262144)) + ((((int)threadIdx.x) & 15) * 16384)) + (k_0_0 * 32)) + ((((int)threadIdx.x) >> 4) * 16)));
the access list of the first warp:
tid 0, access A: [0][0] ~ [0][16]
tid 1, access A: [1][0] ~ [1][16]
tid 2, access A: [2][0] ~ [2][16]
tid 3, access A: [3][0] ~ [3][16]
tid 4, access A: [4][0] ~ [4][16]
tid 5, access A: [5][0] ~ [5][16]
tid 6, access A: [6][0] ~ [6][16]
tid 7, access A: [7][0] ~ [7][16]
tid 8, access A: [8][0] ~ [8][16]
tid 9, access A: [9][0] ~ [9][16]
tid 10, access A: [10][0] ~ [10][16]
tid 11, access A: [11][0] ~ [11][16]
tid 12, access A: [12][0] ~ [12][16]
tid 13, access A: [13][0] ~ [13][16]
tid 14, access A: [14][0] ~ [14][16]
tid 15, access A: [15][0] ~ [15][16]
tid 16, access A: [0][16] ~ [0][32]
tid 17, access A: [1][16] ~ [1][32]
tid 18, access A: [2][16] ~ [2][32]
tid 19, access A: [3][16] ~ [3][32]
tid 20, access A: [4][16] ~ [4][32]
tid 21, access A: [5][16] ~ [5][32]
tid 22, access A: [6][16] ~ [6][32]
tid 23, access A: [7][16] ~ [7][32]
tid 24, access A: [8][16] ~ [8][32]
tid 25, access A: [9][16] ~ [9][32]
tid 26, access A: [10][16] ~ [10][32]
tid 27, access A: [11][16] ~ [11][32]
tid 28, access A: [12][16] ~ [12][32]
tid 29, access A: [13][16] ~ [13][32]
tid 30, access A: [14][16] ~ [14][32]
tid 31, access A: [15][16] ~ [15][32]
And the second kernel’s access pattern is
*(int4 *)(A_shared + (((((int)threadIdx.y) * 1024) + (((int)threadIdx.z) * 512)) + (((int)threadIdx.x) * 16))) = *(int4 *)(A + (((((((int)blockIdx.y) * 1048576) + (((int)threadIdx.y) * 524288)) + (((int)threadIdx.z) * 262144)) + (kk_0 * 512)) + (((int)threadIdx.x) * 16)));
the access list of the first warp:
Problem Size : M 16384 N 16384 K 16384
tid 0, access A: [0][0] ~ [0][16]
tid 1, access A: [0][16] ~ [0][32]
tid 2, access A: [0][32] ~ [0][48]
tid 3, access A: [0][48] ~ [0][64]
tid 4, access A: [0][64] ~ [0][80]
tid 5, access A: [0][80] ~ [0][96]
tid 6, access A: [0][96] ~ [0][112]
tid 7, access A: [0][112] ~ [0][128]
tid 8, access A: [0][128] ~ [0][144]
tid 9, access A: [0][144] ~ [0][160]
tid 10, access A: [0][160] ~ [0][176]
tid 11, access A: [0][176] ~ [0][192]
tid 12, access A: [0][192] ~ [0][208]
tid 13, access A: [0][208] ~ [0][224]
tid 14, access A: [0][224] ~ [0][240]
tid 15, access A: [0][240] ~ [0][256]
tid 16, access A: [0][256] ~ [0][272]
tid 17, access A: [0][272] ~ [0][288]
tid 18, access A: [0][288] ~ [0][304]
tid 19, access A: [0][304] ~ [0][320]
tid 20, access A: [0][320] ~ [0][336]
tid 21, access A: [0][336] ~ [0][352]
tid 22, access A: [0][352] ~ [0][368]
tid 23, access A: [0][368] ~ [0][384]
tid 24, access A: [0][384] ~ [0][400]
tid 25, access A: [0][400] ~ [0][416]
tid 26, access A: [0][416] ~ [0][432]
tid 27, access A: [0][432] ~ [0][448]
tid 28, access A: [0][448] ~ [0][464]
tid 29, access A: [0][464] ~ [0][480]
tid 30, access A: [0][480] ~ [0][496]
tid 31, access A: [0][496] ~ [0][512]
Assume array A is a 16384x16384 matrix, from my understanding if thread in a warp can coalesced at least 32 bytes, they can leverage dram burst and fill the bandwith, but the first access pattern is more unefficient, takes a giant overhead, anybody can explain it?
The code is test under a single 24GB 3090 gpu, with ubuntu 20.04 and CUDA 11.1 installed.