GPUDirect RDMA/Async for DL Acceleration (MPI/NCCL)

I am looking for information regarding fully supporting GPUDirect on the NIC’s side.

I was able to DMA between a SmartNIC and a V100 GPU, based on the gdrcopy example, as well as read/write SmartNIC memory from within a CUDA kernel using cuMemHostRegister() and cuMemHostGetDevicePointer(). cuStreamWriteValue32() is also working.

However, to make this useful and more general it should work transparently with things like MPI and NCCL. I think it should make use of the gdsync library, but this is where I need more information and guidance. Most resources I’ve found explains CUDA-Aware MPI and NCCL from a user’s point of view, but I couldn’t as of yet find information about what need to be implemented on the NIC’s side.

I’m trying to look through the GDAsync repo on GitHub, but still get lost between what is GPU/Host side code and which part is written specifically for the NIC, exposing its RX/TX queues to the application running on top of it. There seem to be a Template class that needs to be implemented for the NIC, but then there are also device send/wait/sync functions which I can’t find the implementations of.

Can anyone please point me to documentation or give advice on how to approach this?