While studying the CUDA PTX manual for asynchronous copy operations, I noticed that the use of TMA (Tensor Memory Access) and im2col scenarios (cp.async.bulk.tensor.4d.shared::cluster.global.im2col) is quite rare in upper-level applications. Although there is a function make_im2col_tma_copy
in CUTLASS that wraps the PTX interface, I couldn’t find related usage examples. Could you explain the benefits of using TMA im2col transfer, and how to utilize it?
CODE:
device void load_4d_im2col(void const* desc_ptr, barrier &mbar, void * smem_ptr,
int32_t const& coord_n, int32_t const& coord_h, int32_t const& coord_w, int32_t& coord_c, uint16_t const& offset_w, uint16_t const& offset_h) {
uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
uint32_t smem_int_ptr = cast_smem_ptr_to_uint(smem_ptr);
uint32_t mbar_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(&mbar));
asm volatile (
“cp.async.bulk.tensor.4d.shared::cluster.global.im2col.mbarrier::complete_tx::bytes”
" [%0], [%1, {%3, %4, %5, %6}], [%2], {%7, %8};"
:
: “r”(smem_int_ptr), “l”(gmem_int_desc), “r”(mbar_ptr),
“r”(coord_n), “r”(coord_h), “r”(coord_w), “r”(coord_c),
“h”(offset_w), “h”(offset_h)
: “memory”);
}