What is the start marker of the scope containing "async_bulk" in "ptx::cp_async_bulk_commit_group()"?

yuanxd20 · September 11, 2024, 11:16am

Hi,
I read the cuda programing guide(CUDA C++ Programming Guide). I have two questions about the use cases of TMA:

#include <cuda/barrier>
#include <cuda/ptx>
using barrier = cuda::barrier<cuda::thread_scope_block>;
namespace ptx = cuda::ptx;

static constexpr size_t buf_len = 1024;
__global__ void add_one_kernel(int* data, size_t offset)
{
  // Shared memory buffer. The destination shared memory buffer of
  // a bulk operations should be 16 byte aligned.
  __shared__ alignas(16) int smem_data[buf_len];

  // 1. a) Initialize shared memory barrier with the number of threads participating in the barrier.
  //    b) Make initialized barrier visible in async proxy.
  #pragma nv_diag_suppress static_var_with_dynamic_init
  __shared__ barrier bar;
  if (threadIdx.x == 0) { 
    init(&bar, blockDim.x);                      // a)
    ptx::fence_proxy_async(ptx::space_shared);   // b)
  }
  __syncthreads();

  // 2. Initiate TMA transfer to copy global to shared memory.
  if (threadIdx.x == 0) {
    // 3a. cuda::memcpy_async arrives on the barrier and communicates
    //     how many bytes are expected to come in (the transaction count)
    cuda::memcpy_async(
        smem_data, 
        data + offset, 
        cuda::aligned_size_t<16>(sizeof(smem_data)),
        bar
    );
  }
  // 3b. All threads arrive on the barrier
  barrier::arrival_token token = bar.arrive();
  
  // 3c. Wait for the data to have arrived.
  bar.wait(std::move(token));

  // 4. Compute saxpy and write back to shared memory
  for (int i = threadIdx.x; i < buf_len; i += blockDim.x) {
    smem_data[i] += 1;
  }

  // 5. Wait for shared memory writes to be visible to TMA engine.
  ptx::fence_proxy_async(ptx::space_shared);   // b)
  __syncthreads();
  // After syncthreads, writes by all threads are visible to TMA engine.

  // 6. Initiate TMA transfer to copy shared memory to global memory
  if (threadIdx.x == 0) {
    ptx::cp_async_bulk(
        ptx::space_global,
        ptx::space_shared,
        data + offset, smem_data, sizeof(smem_data));
    // 7. Wait for TMA transfer to have finished reading shared memory.
    // Create a "bulk async-group" out of the previous bulk copy operation.
    ptx::cp_async_bulk_commit_group();
    // Wait for the group to have completed reading from shared memory.
    ptx::cp_async_bulk_wait_group_read(ptx::n32_t<0>());
  }
}

Questions:
1、Is the starting position of “ptx::cp_async_bulk_commit_group();” the “ptx::fence_proxy_async(ptx::space_shared);” in step 5?
2、Because of the existence of “__syncthreads();”, is “ptx::fence_proxy_async(ptx::space_shared);” in 5 redundant?

Topic		Replies	Views
TMA async bulk tensor copy memory consistency CUDA Programming and Performance	0	539	April 25, 2024
using PTX barrier.sync CUDA Programming and Performance	12	3753	March 27, 2019
Using memcpy_async in matrix transpose CUDA Programming and Performance cuda , ampere	4	795	February 10, 2024
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3168	May 6, 2007
copy from 1D array to shared memory matrix in cuda CUDA Programming and Performance	7	2053	June 9, 2015
Possible race condition in TMA examples CUDA Programming and Performance	4	129	August 7, 2024
beginner question regarding shared memory CUDA Programming and Performance	4	6920	November 16, 2009
Questions about shared memory and branching CUDA Programming and Performance	2	2823	September 5, 2009
Shared memory matrix multiplication not working CUDA Programming and Performance	6	46	October 11, 2024
Thread Synchronization CUDA Programming and Performance	4	8488	October 2, 2009

What is the start marker of the scope containing "async_bulk" in "ptx::cp_async_bulk_commit_group()"?

Related topics