GPUDirect RDMA support with CUDA 5


I have several applications where data transfer is a major issue and am interested in trying to get GPUDirect RDMA to work. Consider a cluster of nodes connected via 10 GigE (can choose adapter if that helps) hosting Tesla K20s and running CUDA 5.

Thus far, almost all of the information that I have found regarding GPUDirect already working is via Infiniband adapters. I have done quite a bit of reading on RDMA over Ethernet, but am confused as to its current status. From what I can tell, development is in progress and it is basically not yet supported by anyone.

So, is anyone aware of a 10 GigE adapter that currently supports GPUDirect? Also, is support only required on the receiving node (e.g., in the case that data is streaming over 10 GigE from an FPGA onto a host with a 10 GigE NIC and a GPU)?


Let me update the above question a bit. Based on my reading of the GPUDirect RDMA documentation (, it sounds as though Ethernet drivers should theoretically be able to support zero-copy transfers to GPU memory with an Ethernet driver update in the case that the devices share the same upstream root complex.

Is that interpretation correct? If so, is anyone aware of any ongoing work to enable GPUDirect support for any Ethernet NIC?


Hi Thomas,
did you get any more info (privately)?
In “CUDA 5 – Everything You Need to Know” of Oct 24 2012 they said that RDMA only worked on Linux so far and you need to modify the kernel…

I have the exact same situation. I am not interested in using Infiniband, and in looking into the GPUDirect technology it doesn’t seem like the GPU should care if it is an Infiniband card or an Ethernet card, it is just a DMA operation, though I only see Infiniband support for it. I am thinking there is nothing stopping an Ethernet NIC from doing the same thing, just that no one may have implimented it yet. If there is anymore information on this please let me know.


Thanks for the input, James!
Well, actually I am interested in Infiniband. Guess we´d possibly need it for the high end, but some potential choice in the middle ground wouldn´t hurt.

And what worries (and p*****) me is the deafening silence here as soon as it comes to quite special questions.
I´ll try the mail address from the GPUDirect page.

I can only speculate the silence is due to the minimal overlap between CUDA developers and Linux kernel/driver developers. They are largely separated by the fact that the former have Nvidia’s closed source driver (which is needed for CUDA) installed on their systems while the latter use the open source Nouveau driver.

Hi all,

I have gotten some additional information. There was quite a bit of discussion regarding GPUDirect RDMA at the GPU Technology Conference in March. In short, other PCIe devices (e.g., 10GbE NICs) can support GPUDirect RDMA, but someone has to write a driver which does so.

There was one talk where the presenter had written a driver for a Myricom 10GbE card, presumably starting with an open source driver for that NIC. I believe it was session S3300 (see NVIDIA should post the slides and video of the talks soon (~1 month after the conference). Unfortunately, I have not heard of any 10GbE vendor who plans on releasing/supporting a driver that supports GPUDirect RDMA, but hopefully that will happen.

There were some other interesting GPUDirect RDMA talks as well (e.g., S3266 and S3504).

One limitation to be aware of is that the devices must share the same PCIe root complex. In the case of dual socket Xeon Sandy Bridge servers, the CPUs each host their own PCIe root complex and the PCIe slots will typically be associated with one or the other. You cannot use GPUDirect to transfer data from a device hosted by socket A to a device hosted by socket B (i.e., you cannot cross QPI with GPUDirect RDMA – the same holds for GPU-to-GPU transfers using GPUDirect 2.0).


Thanks a ton, Thomas, for the great info (from GTC)!

Guess we´ll need to live with this PCIe root complex thing for the time being.

I checked a Supermicro 1027GR TQF 1U dual Xeon superserver which can host 4 PCIe 3 16x accelerators.

You must populate both CPU sockets, because one CPU has 40 QPI lines and so can only reasonably connect to 2 accelerators in 2 PCIe 3 16x slots.

Question is how you are going to handle the situation. Put in two K20 and two Infiniband adapters if you want to put it in a cluster?

Let´s keep this thread alive and get to REAL ANSWERS!
Thanks again


Mellanox has already announced that they will be releasing GPUDirect RDMA support (, so transfers from the Infiniband adapter to devices sharing a root complex should work soon.

As for the dual socket configurations, the CPUs will need to be worked around if you want all of the devices under the same PCIe root. Cirrascale has an 8 GPU blade server ( which accomplishes this by adding a pair of 80 lane PCIe riser cards so that you can daisy chain those switches from a single CPU. I assume they have 16 PCIe lanes going from the CPU to each 80 lane switch, but I don’t know that for sure. Haswell-based Xeons are reportedly going to have 40 PCIe lanes also (based on rumor sites, so maybe not). Hopefully vendors start adding 80 or 96 lane PCIe switches to systems to support higher accelerator density.


Thanks, Thomas,
again great info!
Do you have any idea how transparent these PCIe switches are / will be to drivers and if / what the switch will add to latency / timing - or if any Intel chipset will include such a switch to accommodate for two / four CPUs?

Hi everyone,
i’ve recently been interested in GPUDirect technology over InfiniBand networks. I’ve read many articles about GPUDirect, but I’m a bit cofused about doing RDMA on GPUs’ memory across different nodes without involving the host CPU (I refer to it as GPUDirect v3.0)(…

At it states that this particluar feature is available on CUDA 5.0. Also, as far as I know, this feature it’s strictly related to NVIDIA’s Kepler architecture. Is it right?

In addiction, I need to clarify some points…

  1. Is it actually possible to do RDMA on GPUs’ memory across different nodes in InfiniBand network, without involving host CPU in transfer?

  2. Which InfiniBand adapter is able to handle this kind of transfer?

  3. Some MPI library, like MVAPICH2, permits to transfer data across nodes using the most efficient method available, based on where the buffer that we are going to transfer is stored… It is intended only for GPUDirect 1.0 (involving CPU) or 3.0 too (bypassing CPU)?

Thank you!

Ok, I finally got some useful infos.
As I can read on this article (, it seems that the Mellanox ConnectX-3 adapters are able to handle GPUDirect 3.0 transfers:

Reading the ConnectX-3 adapter manual (, I’ve also found these requirements for making GPUDirect works:

but it is not specified if this setup refers to the one needed for GPUDirect 1.0 or 3.0, they simply call it “GPU-to-GPU method”. Can anyone explain me if it is referred to GPUDirect 1.0 or 3.0 method?

You are looking at GPUDirect 1.0 support. Mellanox does not yet support GPUDirect 3.0 (a.k.a GPUDirect RDMA), but they have announced that they will support it. They will first need to release a version of OFED with such support (see I have not seen any other NIC vendors announce that they will support GPUDirect RDMA, but perhaps such announcements will start to emerge soon.

Note that what NVIDIA provided is the ability for other vendors to add support for GPUDirect RDMA. You can see the details in the pdf in post #2 above. NVIDIA cannot provide such support themselves, unless perhaps the vendor drivers are open source and NVIDIA adds the support themselves.

Hi all,
More than a week ago I tried to get current info from nVidia by mailing to – to no avail.

So I went ahead and forwarded the mail to Qlogic and Mellanox. (addresses quoted on the GPUDirect page) Got a receipt confirmation and the promise of a 24h followup from Qlogic and a super-dry
“It is already integrated as part of CUDA 4.1 or later.”
from Mellanox that angered me, but they just followed up for clarification.

I´ll keep you posted what I find out!

Thanks a lot for this clarification.

Well, “do it yourself” is a good point to start with. I’ll check that, thank you again!

Hi all,
as the GTC 2013 sessions became available @
I checked out
Mellanoxes Pak Lui with S3504 - RDMA for GPUDirect
which rendered the following insights:

W/o GPUDirect mem copy device to host and within the host main mem to Infiniband memory region
GPUDirect 1.0 w/ CUDA 3 eliminates extra copy step in main memory
GPUDirect 2 w/ CUDA 4 enables peer to peer comm amongst NVIDIA devices
GPUDirect RDMA or 3.0 in Q2 2013 (!!!) enables peer to peer comm between NVIDIA and Infiniband device

He made a reference to
MVAPICH2 (MPI-3 over OpenFabrics-IB, OpenFabrics-iWARP, PSM, uDAPL and TCP/IP) by D.K. Panda @ OSU @

Funny other side of the coin: TACC is biggest Xeon Phi site!
“MVAPICH2 drives 7th ranked Multi-Petaflop TACC Stampede system with 204,900 cores, InfiniBand FDR and Intel MIC” @

Preliminary benchmarks (for MPI): One can only see latency and throughput improvements for small messages (but not for large ones)! So apparently one needs to exactly check ones own needs and if it´s worth it, especially in our case where it´s not (only) MP!

Hi, thank you for pointing out this new presentation! I found it very interesting and it shed some light on actual GPUDirect development status.

Anyway, I can’t understand what “Q2 2013” refers to… I imagine it is referred to a new Mellanox’ driver which will be released in 2013… is it right? :)

Yeah, Q2 2013 refers to the Mellanox driver. I´d take this prognosis with a grain of salt… ;-)
This is MPI. I need to check some other pivotal points, but educate myself on all the moving parts.

Didn´t get anything from QLogic so far. They told me they´d get back in 24h, but after a week I turned to Intel who own their IB business now. Still waiting.

There is at least one other interesting video from GE
S3266 - GPUDirect Support for RDMA and Green Multi-GPU Architectures,
but I didn´t have time to watch it so far.


HIGHLY RECOMMENDED - the GE talk: S3266 - GPUDirect Support for RDMA and Green Multi-GPU Architectures

Latency improvements: 16x @ 16KB DMA size, still 5x @ 2048KB

Throughput example was 2GB/s with a gen 1 x8 PCIe FPGA; expect up to 12GB/s on PCIe 3 16 lanes

Rule of thumb WAS 1 core per GPU. With RDMA CPU load decreased to 10%!!!
VEEERY interesting consequence: 4:1 GPU:CPU ratio instead of 1:1 and a PCIe switch (currently 96x max) instead of dual-socket for better whole-system performance (GFlops/W) and fully connected peers! Nest switches for more!

Mellanox ConnectX OFED explicitly named for backend and inter-process comm.

Have fun watching
Seems that Mellanox has released an alpha version of ConnectX driver that support RDMA with K10/K20.
Check it!