How to use cuTensorMapEncodeIm2col and TMA im2col copy

wuyuan195 · December 18, 2024, 8:32am

While studying the CUDA PTX manual for asynchronous copy operations, I noticed that the use of TMA (Tensor Memory Access) and im2col scenarios (cp.async.bulk.tensor.4d.shared::cluster.global.im2col) is quite rare in upper-level applications. Although there is a function make_im2col_tma_copy in CUTLASS that wraps the PTX interface, I couldn’t find related usage examples. Could you explain the benefits of using TMA im2col transfer, and how to utilize it?

CODE:
device void load_4d_im2col(void const* desc_ptr, barrier &mbar, void * smem_ptr,
int32_t const& coord_n, int32_t const& coord_h, int32_t const& coord_w, int32_t& coord_c, uint16_t const& offset_w, uint16_t const& offset_h) {
uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
uint32_t smem_int_ptr = cast_smem_ptr_to_uint(smem_ptr);
uint32_t mbar_ptr = static_cast<uint32_t>(__cvta_generic_to_shared(&mbar));
asm volatile (
“cp.async.bulk.tensor.4d.shared::cluster.global.im2col.mbarrier::complete_tx::bytes”
" [%0], [%1, {%3, %4, %5, %6}], [%2], {%7, %8};"
:
: “r”(smem_int_ptr), “l”(gmem_int_desc), “r”(mbar_ptr),
“r”(coord_n), “r”(coord_h), “r”(coord_w), “r”(coord_c),
“h”(offset_w), “h”(offset_h)
: “memory”);
}

Topic		Replies	Views
How to use "cuTensorMapEncodeIm2col" Isaac Sim cuda	0	82	December 11, 2024
GM2=GM1 is faster than "SM=GM1; GM2=SM;" ? memory access time CUDA Programming and Performance	10	5445	April 19, 2007
about __syncthreads() in SDK/project/transpose CUDA Programming and Performance	5	2767	September 18, 2009
transpose example, SDK 3.2 CUDA Programming and Performance	4	7501	March 15, 2011
Non-sequencial memory access coalescing CUDA Programming and Performance	4	871	April 2, 2013
Retrieve array columns CUDA Programming and Performance	2	4048	December 9, 2010
how to using optimized general matrix multiplication(GEMM) & memory re-ordering operations such as im2col Jetson TX2	2	691	October 18, 2021
Question about tranpose CUDA Programming and Performance	19	7425	June 11, 2008
Matrix transpose slower using shared memory CUDA Programming and Performance	5	1073	August 7, 2015
transposed matrix of size N*M CUDA Programming and Performance	0	749	August 12, 2011

How to use cuTensorMapEncodeIm2col and TMA im2col copy

Related topics