Using cuda pipeline with TMA

Looking at the official documentation of async memcpy, there are multiple ways of doing async memcpy:

  • pipelining with memcpy_async doc
  • using the TMA related memcpy_async doc

The pipelined async_memcpy seems to be natively support with memcpy_async from cuda/pipeline taking in a pipeline instance. However, there seems to be no similar API for the TMA’s group of APIs.

Is one able to speed up the TMA async mecmpy with pipeline with current interfaces and APIs?