A severe limitation of current GPUDirect RDMA is lack of sub-kernel level memory ordering with GPUDirect RDMA peers. Specifically, memory ordering between a peer device and a GPU kernel is only enforced at kernel boundaries. This is described in section 2.7 of the GPUDirect RDMA documentation:
Several papers:
- GPUrdma | Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers
- GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM | IEEE Conference Publication | IEEE Xplore
mention fine grain, GPU-centric, peer communication implementation issues driven by this limitation.
Is there any update to this limitation that I’ve missed, or plans to address this limitation in future hardware?