Looking at the official documentation of async memcpy, there are multiple ways of doing async memcpy:
The pipelined async_memcpy
seems to be natively support with memcpy_async
from cuda/pipeline
taking in a pipeline instance. However, there seems to be no similar API for the TMA’s group of APIs.
Is one able to speed up the TMA async mecmpy with pipeline with current interfaces and APIs?