Ping-pong: different location in same device can happen?

Hi! I am learning ping-pong strategy. Such as:
allocate a double shared memory, such as A[2][64][64]

load first data into A[0][?][?]

for (  ....   )
    prefetch second data into A[1][?][?]   <---------------(1)
    get value from A[0][?][?] and calculate  <------------(2)

So (1) and (2) can actually happen at the same time, although they are processing shared memory at the same time, but they are touching different location, right?

Or maybe, they access shared memory one by one, but writing into shared memory maybe slower, and the later part happen the same time with the calculation part?

Thank you!!!

////////////////////////////////////////////////////////////////////////

By the way, I have tried to search on Google, such as this:

But can not exactly solve my confusion…I think this question will also benefit later learner ^.^

Whether or nor (1) and (2) can happen at the same time depends on the GPU architecture and on the implementation of prefetch. On Ampere with memcpy_async, it should work.

1 Like

That sounds good! But I am using Turing 1650, and hope this kernel to be used in different structures…Do you know any examples? I found one code claims they can achieve ping-pong, well, I do not know how to test it, would you help me to check this?

  for (int loop = 0; loop < k; loop += 8) {
// calc
#pragma unroll
    for (int subk = 0; subk < 8; ++subk) {

      if (7 == subk and loop < k - 8) {
        // if have more, load next
        sts128(a_ldg_reg[0], a_ldg_reg[1], a_ldg_reg[2], a_ldg_reg[3],
               a_sts_addr);

#pragma unroll
        for (int i = 0; i < 4; ++i) {
          sts32(b_ldg_reg[i], b_sts_addr + i * 32 * sizeof(float));
        }
        __syncthreads();
        from_a += 8;
        from_b += 8 * n;

        aptr_base ^= 0x2000;
        bptr_base ^= 0x1000;
        a_sts_addr ^= 0x2000;
        b_sts_addr ^= 0x1000;
      }

      const int pp = (subk + 1) % 2; // ping-pong index
      lds128(panelA[pp][0], panelA[pp][1], panelA[pp][2], panelA[pp][3],
             aptr_base + ((subk + 1) % 8) * SMEM_LDA * sizeof(float));
      lds128(panelA[pp][4], panelA[pp][5], panelA[pp][6], panelA[pp][7],
             aptr_base + (((subk + 1) % 8) * SMEM_LDA + 64) * sizeof(float));

      lds128(panelB[pp][0], panelB[pp][1], panelB[pp][2], panelB[pp][3],
             bptr_base + ((subk + 1) % 8) * SMEM_LDB * sizeof(float));
      lds128(panelB[pp][4], panelB[pp][5], panelB[pp][6], panelB[pp][7],
             bptr_base + (((subk + 1) % 8) * SMEM_LDB + 64) * sizeof(float));

      if (0 == subk and loop < k - 8) {
#pragma unroll
        for (int i = 0; i < 4; ++i) {
          ldg32_nc_0(a_ldg_reg[i],
                     (const char *)(a + from_a) + i * k * sizeof(float));
        }
        // load gmem to smem for bshare
#pragma unroll
        for (int i = 0; i < 4; ++i) {
          ldg32_nc_0(b_ldg_reg[i],
                     (const char *)(b + from_b) + i * 32 * sizeof(float));
        }
      }

#pragma unroll
      for (int i = 0; i < 8; ++i) {
#pragma unroll
        for (int j = 0; j < 8; ++j) {
          sum[i][j] += panelA[subk % 2][i] * panelB[subk % 2][j];
        }
      }
    }
  }

from : how-to-optimize-gemm/MMult_cuda_12.cu at master · tpoisonooo/how-to-optimize-gemm · GitHub

Because I see this example in ampere structure and it uses very complicated notations…such as cuda::memcpy_async and so on…which is different from the code I mentioned above.

Also, memcpy_async seems based on cuda 11.1 but not for ampere? because I see cuda::memcpy_async…?

There is cooperative_groups::memcpy_async and cuda::memcpy_async . The former is described here (with example for pingpong):

The later is described here cuda::memcpy_async - libcu++