Request for clarification on TMA docs

Nvidia docs state that:

The primary difference between the one-dimensional and multi-dimensional case is that a tensor map must be created on the host and passed to the CUDA kernel.

However, in PTX 8.3 the tensormap.replace instruction is added, which allows you to modify a tensormap which has been loaded into GMEM.

So it seems like it’s actually possible to create a multi-dim tensormap on device?

  • Start with zero-initialized global memory
  • Use tensormap.replace to create the descriptor
  • Use tensormap.cp_fenceproxy to copy descriptor into tensormap proxy
  • Insert a fence with fence.proxy.tensormap::generic.acquire.gpu
  • Use the descriptor to load or store tensors

Is this pattern safe / correct / did I totally misunderstand the docs? Are there any performance trade-offs compared to creating a descriptor on the host and passing it in through constant memory?