cuda with RDMA

I would like to use RDMA to get data from my RDMA-compatible NIC to my v100 GPU. Can this accomplished all from within CUDA? The datastream originates from a server that does not have a GPU: do I need to customize NIC drivers?

Yes, you need “custom” NIC drivers. RDMA capability by itself is not enough. It must be a GPUDirect-customized driver.

The general methodology is discussed here:

https://docs.nvidia.com/cuda/gpudirect-rdma/index.html

Ordinary GPUDirect RDMA activity cannot be purely or entirely orchestrated from CUDA device code. It requires host-code interaction. GPUDirect Async is an evolution of GPUDirect RDMA that allows transfers to be triggered by device activity. You can google for these terms with gtc in your google search string, to get more information.

Thank you for responding. Just to be clear (given my poorly specified post):

  1. Do I need my own “custom” NIC driver even if the card supports “GPUDirect” - not just RDMA. For example, one of the “GPUDirect” Mellanox devices

  2. Regardless of whether or not I need a custom driver - I still will need to implement the necessary host-code interaction, or possibly utilize GPUDirect Async

If you have mellanox connectX adapters that already have support for GPUDirect, then there is nothing else you need software wise, apart from getting the drivers from mellanox installed. In order for GPUDirectRDMA to be functional, the NIC must be on the same PCIE fabric as the GPU that you intend to transfer data to/from. This should be possible with many servers qualified for use with Tesla V100, but it may be something to check.

You do need to implement some form of host code to make this work. A typical use-case would be via CUDA-aware MPI (another search term).

https://devblogs.nvidia.com/introduction-cuda-aware-mpi/

Makes sense - thank you.

So I think I have everything configured/installed correctly, and yes, the NIC is indeed on the same PCIE fabric as the GPU. I am able to run some of the benchmarks out there (e.g., ib-verbs).

I’ve been reading up on the CUDA-aware MPI and becoming somewhat smarter.

My fundamental question is can this be used in configurations that are not server-to-server? All examples out there appear to be for sending/sharing data in a cluster. I have a single GPU server…and would like to have it receive and process data arriving from an arbitrary server that not have a GPU; it just streams Ethernet packets. For instance, can you have an MPI_Recv without an MPI_Send?

No, you can’t have an MPI_Recv without a MPI_Send (you probably could write such a program, but the MPI_Recv would never actually receive anything).

OK - so MPI is “out” as far as a solution for a one-sided (receive-only) application. Would I still be able to have the Mellanox Card pass incoming data to bus addresses of the GPU BAR memory via the DMA engine? If so, it’s extremely unclear to me how this is accomplished, if I need something more than the Mellanox driver, etc. How would you suggest I bypass the copies from/to the CPU using a Mellanox Converged Ethernet network adapter in my application?