Nvidia docs state that:
The primary difference between the one-dimensional and multi-dimensional case is that a tensor map must be created on the host and passed to the CUDA kernel.
However, in PTX 8.3 the tensormap.replace
instruction is added, which allows you to modify a tensormap which has been loaded into GMEM.
So it seems like it’s actually possible to create a multi-dim tensormap on device?
- Start with zero-initialized global memory
- Use
tensormap.replace
to create the descriptor - Use
tensormap.cp_fenceproxy
to copy descriptor into tensormap proxy - Insert a fence with
fence.proxy.tensormap::generic.acquire.gpu
- Use the descriptor to load or store tensors
Is this pattern safe / correct / did I totally misunderstand the docs? Are there any performance trade-offs compared to creating a descriptor on the host and passing it in through constant memory?