GPUDirect question - cudaDeviceCanAccessPeer information

For an RDMA from GPU memory to GPU memory between different hosts through a HCA to work, does the “cudaDeviceCanAccessPeer()” need to be 1?

What are the requirements for cudaDeviceCanAccessPeer to return 1? Is that strictly a physical hardware design on the GPU (a K600 in this case)?
Does the PCI bus layout play into that (dependent on the computer system the GPU is plugged into)?
Or some combination of both?
The host channel adapter in use is a ConnectX-5 VPI.

I can do the RDMA host memory to host memory, even pinned cudaHostMalloc memory… but not cudaMalloc memory. I’m wondering if this hardware setup is even capable of it and if that function is how I make that determination programmatically.

Thanks

cudaDeviceCanAccessPeer does not need to be 1. That would imply the existence of a second device in the same machine/node, and certainly GPUDirect RDMA does not depend on having a minimum of 2 GPUs in each node. You only need 1. (two devices in separate nodes can never be in a CUDA peer relationship, as the CUDA runtime would only have 1 device in view)

However the GPU in question as well as the network adapter must be on the same PCIE fabric. If they are both enumerated from the same PCIE root complex, that is a sufficient condition to satisfy the requirement of “on the same PCIE fabric” but not a necessary condition. It’s generally also sufficient if they are both connected (via PCIE) to the same CPU socket, however there are some nuances here with certain recent Intel CPUs. And if you’re using an AMD CPU, I wouldn’t have anything to say about that.

However if you had a fabric issue, I don’t think you would be getting to the point you are at where it halfway works (CPU memory in node 1 to GPU memory in node 2). Unless you have the fabric issue on one machine but not the other.

nvidia-smi topo -m

will spell out the relationships, including gpu to network adapter

the output could look something like this:

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    mlx4_0  CPU Affinity
GPU0     X      PIX     SYS     SYS     PHB     0-5,12-17
GPU1    PIX      X      SYS     SYS     PHB     0-5,12-17
GPU2    SYS     SYS      X      PHB     SYS     6-11,18-23
GPU3    SYS     SYS     PHB      X      SYS     6-11,18-23
mlx4_0  PHB     PHB     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

The SYS indicates P2P and GPUDirectRDMA transactions cannot follow that path.
The PHB (and PIX) indicates that the devices in question are on “the same PCIE fabric” - P2P or GPUDirect RDMA is possible.

So we see that GPUs 0 and 1 in the above diagram can communicate (for purposes of GPUDirect RDMA) with the ConnectX4 adapter. This assumes the GPUs in question are Quadro or Tesla GPUs also.

Thanks for the info.
I’m thinking its not a fabric issue, at least the HCA fabric (Eth in this case), as I can switch ends and see the same behavior. It does seem to pass host memory to client gpu memory between the two systems, but I just cannot seem to source from the gpu on the host.
I don’t think its the RDMA code as I can use the cudaHostMalloc vs cudaMalloc and it works.
I also can’t get it to work with the perftest either with the option --cuda-rdma enabled. I just don’t know why.
I’ll have to dig into the PCI layout I guess. I tried making sense of the lspci output earlier and noted that the documentation is somewhat lacking.
I would assume that the HCA and the GPU would need to be on the same root complex just the same as two GPUs would.
There is only one CPU, but its a quad core… these computers are also a bit dated.
As a newbie question, can the host/CPU side access the memory pointer returned by cudaMalloc? The RDMA pinning is done on the host side, not in a kernel.

Just read your edit…
I have a VPI card and have two ports defined. One is Ethernet (RoCE) and the other Infiniband.
For the Ethernet adapter, its GPU to device of PHB.
For the IB adapter, its GPU - device of PIX.
GPU Cpu affinity is 0-3.

So the ethernet is traversing a host bridge (described as typically the cpu).
I suspect if I ran this via IB, it might work.
I also wonder if I swap port configurations if the PIX/PHB would follow the ethernet.

I’ll let you know.

thanks much

Using IB (IPoIB) vs EN didn’t change anything. ConnectX-5 VPI cards. K600 GPUs.
I posted a modified the geek in the corner example over on the community.mellanox.com site.
I get a protection fault (4) as a write completion to an RDMA_WRITE gpu memory to gpu memory.
The perftool works… so going by that the hardware is capable of it.
example:
server:
./ib_write_bw -d mlx5_0 -i 1 -F --report_gbits -R --use_cuda
client:
./ib_write_bw -d mlx5_0 -i 1 -F --report_gbitgs 15.15.15.5 -R --use_cuda

Since the perftool says its working, I suspect its me, but I’m not seeing it. I’ve tried doing the allocation the same as I can figure out from the perftool, but I still get the same error, even in that modified example.
This is gpu to gpu between two systems.

community.mellanox.com” under software and drivers, “RDMA GPUDirect//nvidia-peer-memory/cuda issue”

cuda 10.1
nv_peer_memory (nv_peer_memory_master 1.0.8) (Mellanox OFED GPUDirect)
Mellanox OFED 4.6-1.0.1.1
Most all the system information is posted there too.

I’m still not getting this to work. Mellanox checked the perftools and my little rdma sample with success, so its not code. They have a better PC and GPU. Mine are pretty archaic in the scheme of things, though looking them up, this should work.
Is there something programmatically (a flag somewhere) that would indicate whether or not a GPU actually supports gpudirect?

The nvidia-smi topo -m results follow…
Is it because the routing is not all PIX?
Your example shows two GPUs in one box. Can this work to the same GPU?
Though I’m seeing the same error between two like hosts.

[mpiuser@localhost rdmaX]$ nvidia-smi topo -m

	GPU0	mlx5_0	mlx5_1	CPU Affinity
GPU0	 X 	PHB	PHB	0-3
mlx5_0	PHB	 X 	PIX	
mlx5_1	PHB	PIX	 X 	

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks
[mpiuser@localhost rdmaX]$

Thanks

The end result of this is that the K600 Quadro does not support GPUDirect (RDMA).
The perftools have to be compiled to support RDMA, otherwise it doesn’t complain if you try to used that flag. The perftools would not work with the RDMA flag set for the K600.

There is a list here ( https://developer.nvidia.com/gpudirectforvideo ) of supported GPUs, but it is a bit dated. Nobody at NVIDIA seems to know if their newer products support GPUDirect or not. If its not on this list, I would not assume.

Thanks for help though.

The list you pointed to doesn’t look dated as it lists the Quadro RTX 8000, 6000, 5000, 4000, all of which were released in 2018. There are no newer Quadro models.

Maybe, but there are newer gpu cards, eg the Titan family, that have been released.
I would have hoped that the newer products wouldn’t have dropped features.
I also would have hoped that nvidia would be better versed on what their own products support.
That list above is still the best answer I’ve gotten.

“Newer” doesn’t mean anything. Like many other hardware manufacturers (compare the automotive industry), NVIDIA uses market segmentation as part of their sales strategy, including the use of different brand names (e.g. GeForce, Quadro, Tesla, Jetson) for different product lines. In such an approach, particular features are often assigned to / reserved for certain market segments.

Roughly speaking, we observe that GPUdirect support seems to be a feature limited to the high-end professional market segment. Beyond market segmentation, an additional motivation for the restriction might be (speculation!) that the cost of developing, maintaining, and supporting GPUdirect can best be recouped through those (presumably higher margin) products. Spreading support to all CUDA-capable GPUs would drive up cost without a commensurate increase in sales revenue, as a consequence lowering profits.

I do not see anything that looks erroneous in the list you pointed to. It seems complete and up-to-date.