Hi! I am learning ping-pong strategy. Such as:
allocate a double shared memory, such as A[2][64][64]
load first data into A[0][?][?]
for ( .... )
prefetch second data into A[1][?][?] <---------------(1)
get value from A[0][?][?] and calculate <------------(2)
So (1) and (2) can actually happen at the same time, although they are processing shared memory at the same time, but they are touching different location, right?
Or maybe, they access shared memory one by one, but writing into shared memory maybe slower, and the later part happen the same time with the calculation part?
Whether or nor (1) and (2) can happen at the same time depends on the GPU architecture and on the implementation of prefetch. On Ampere with memcpy_async, it should work.
That sounds good! But I am using Turing 1650, and hope this kernel to be used in different structures…Do you know any examples? I found one code claims they can achieve ping-pong, well, I do not know how to test it, would you help me to check this?
for (int loop = 0; loop < k; loop += 8) {
// calc
#pragma unroll
for (int subk = 0; subk < 8; ++subk) {
if (7 == subk and loop < k - 8) {
// if have more, load next
sts128(a_ldg_reg[0], a_ldg_reg[1], a_ldg_reg[2], a_ldg_reg[3],
a_sts_addr);
#pragma unroll
for (int i = 0; i < 4; ++i) {
sts32(b_ldg_reg[i], b_sts_addr + i * 32 * sizeof(float));
}
__syncthreads();
from_a += 8;
from_b += 8 * n;
aptr_base ^= 0x2000;
bptr_base ^= 0x1000;
a_sts_addr ^= 0x2000;
b_sts_addr ^= 0x1000;
}
const int pp = (subk + 1) % 2; // ping-pong index
lds128(panelA[pp][0], panelA[pp][1], panelA[pp][2], panelA[pp][3],
aptr_base + ((subk + 1) % 8) * SMEM_LDA * sizeof(float));
lds128(panelA[pp][4], panelA[pp][5], panelA[pp][6], panelA[pp][7],
aptr_base + (((subk + 1) % 8) * SMEM_LDA + 64) * sizeof(float));
lds128(panelB[pp][0], panelB[pp][1], panelB[pp][2], panelB[pp][3],
bptr_base + ((subk + 1) % 8) * SMEM_LDB * sizeof(float));
lds128(panelB[pp][4], panelB[pp][5], panelB[pp][6], panelB[pp][7],
bptr_base + (((subk + 1) % 8) * SMEM_LDB + 64) * sizeof(float));
if (0 == subk and loop < k - 8) {
#pragma unroll
for (int i = 0; i < 4; ++i) {
ldg32_nc_0(a_ldg_reg[i],
(const char *)(a + from_a) + i * k * sizeof(float));
}
// load gmem to smem for bshare
#pragma unroll
for (int i = 0; i < 4; ++i) {
ldg32_nc_0(b_ldg_reg[i],
(const char *)(b + from_b) + i * 32 * sizeof(float));
}
}
#pragma unroll
for (int i = 0; i < 8; ++i) {
#pragma unroll
for (int j = 0; j < 8; ++j) {
sum[i][j] += panelA[subk % 2][i] * panelB[subk % 2][j];
}
}
}
}
Because I see this example in ampere structure and it uses very complicated notations…such as cuda::memcpy_async and so on…which is different from the code I mentioned above.