I am having trouble setting up GPU Direct on the local machines. Here’s the the local software and hardware:
- GPU Tesla P100-SXM2
- Adaptor(MLNX)
5e:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
5e:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] - Cuda compilation tools, release 10.1, V10.1.243
- Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-167-generic x86_64)
I tested the RDMA connection by using ibping, and it works fine.
--- anton-j0.(none) (Lid 2) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 1030 ms
rtt min/avg/max = 0.005/0.103/900.020 ms
However, when I was trying to get GPU Direct RDMA to run, nv_peer_mem
wouldn’t install. And as the github demo indicated, it requires ConnectX 5+ to work.
I tried to find other ways that is compatible with ConnectX 4 but hasn’t found anything useful yet. I checked the forum and someone got ConnectX 3 pro to work on GPU Direct RDMA. Could someone give me some guidelines to get GPU Direct RDMA working on ConnectX 4?