GPUDirectRDMA enabled GPUs

Hello folks,
I read through the posts here, and the info here: GPUDirect RDMA :: CUDA Toolkit Documentation
It is still difficult to figure out which GPUs enable GPUDirect RDMA, and which don’t.
For example, in GPUDirect RDMA :: CUDA Toolkit Documentation it says:
“GPUDirect RDMA is available on both Tesla and Quadro GPUs.”
Here, is Tesla referring to microarchitecture or the brand of GPUs like Tesla P100/V100?
In nv-p2p.h file:
enum {
NVIDIA_P2P_ARCHITECTURE_TESLA = 0,
NVIDIA_P2P_ARCHITECTURE_FERMI,
NVIDIA_P2P_ARCHITECTURE_CURRENT = NVIDIA_P2P_ARCHITECTURE_FERMI
};

This seems to indicate Tesla as the microarchitecture. In that case, what Tesla GPUs are supported?
Please help!

Its referring to the brand. Just like Quadro is a brand.

Does that mean Tesla P100/V100 support GPUDirectRDMA? The information is very sparse to get confirmation on this.

Best I can tell, the answer is “yes”. The best supporting quotes I can find on NVIDIA’s website are:

https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper-v1.2.pdf
Whitepaper NVIDIA Tesla P100

https://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf
NVIDIA DGX-1 With Tesla V100 System Architecture

Robert Crovella may be able to point to better, more meaningful, language.

Note that the designated sales channel for Tesla-brand products is system integrators, not sales directly to end users. These GPUs are intended to be sold and supported as part of a system. System vendors on NVIDIA’s partner list that sell system with integrated Tesla GPUs should be able to tell you what features are supported by their systems. You are probably aware that you need an RDMA-capable counter part to the GPU (such as one of various Mellanox adapters) to take advantage of GPUDirect RDMA. There may also be further platform requirements, but this is not my area of expertise and I don’t know details.

See: https://developer.nvidia.com/gpudirectforvideo

Granted, this is talking specifically about “GpuDirect for Video”, but this also means RDMA does work on these GPUs.

Supported GPUs
Quadro RTX 8000, 6000, 5000, 4000
Quadro GV100, GP100, P6000, P5000, P4000
Quadro M6000, M5000, M4000
Quadro K4000, K4200, K5000, K5200 and K6000
Quadro 4000, 5000, and 6000
Quadro M5000M equipped mobile workstations
Quadro K5100M equipped mobile workstations
GRID K1 and K2
Tesla T4
Tesla V100, P100, K10, K20, K20X, K40
Tesla C2075 and M2070Q

I don’t find a mention of RDMA on the linked page, only copying via pinned host memory. It appears confusion than Nvidia slaps the “GPUDirect” moniker onto less and less direct pathways.

“Curiouser and curiouser!”…How else do you plan on getting data directly into GPU memory? Is there any other connection available on the GPU?

RDMA used to indicate DMA directly from a PCIe device into GPU memory, without host memory involvement.

Now staging via (pinned) host memory in small chunks isn’t necessarily a bad thing (as the buffering decouples timing of the two PCIe devices and may actually improve throughout).

But this seems to me like it was already possible with just a bit of CUDA programming, without having to wait for any driver improvements.

GpuDirect has been around for quite some time https://developer.nvidia.com/gpudirect.

RDMA simply means Remote DMA (Direct Memory Access). It’s not an NVIDIA term. With respect to NVIDIA, as you mentioned, it does mean having the ability to write directly to GPU Memory.

An Aside:

Industry has taken advantage of GpuDirect RDMA, as with the link mentioned before and there is also a way to move data to/from GPU memory to an NVME capable SSD. I really like this capability, but realize SSDs have limited read/writes. If your pushing GBs of data every second, that SSD isn’t going to last you all that long. But, depending on the needs, a few day may be good enough; Expensive, but that’s relative as well.

NVLINK is also a great feature! In the GeForce world it’s quite limited, but move over to Quadros and you just doubled your memory (Unified Memory). One GPU for computations and the other GPU for rendering. Both devices have direct access to the shared memory - no PCIe traffic whatsoever as it’s going over dedicated NVLINK.