POWER9 GPUDirect poor performance (39Gb/s Connectx-5 to Tesla only)

Dear all

I am PhD student at the ESRF.eu (large synchrotron radiation facility in Grenoble, France) working on high throughput data transfer from X ray image detector to GPU for online processing

we have an AC922 workstation (POWER9, connectx-5EN, Tesla V100) and I get a chance to put my hand on it.

I do not understand why I have poor RoCEv2 BW result (only 39Gb/s) when using GPUDirect peer 2 peer transfer.
Other results are as expected:

  • BW is as expected for RDMA transfer to AC922 CPU memory (97Gb/s).
  • GPUDirect BW is correct on a Quadro P6000/XEON DELL
  • Testing are done using mellanox perftest/ib_write_bw using UC queue pair or a home made libibverbs application (same results).
  • throughput inside computer from host to device or device to host are correct
  • ATS code is working

I suspect it could be related to the Connectx-5 configuration (shared by mlx5_0 mlx5_1) or some CAPI / NVLINK tweaks… Can you clarify ?

setup:

  • 3 computer 1xIBM AC922 2xDELL R740

one link R740-1 R740-2 back to back

second link R740-1 AC922 back to back

same software in all tests, Connectx5 RNIC, OFED

  • nv_peer_mem and nv_rsync_mem modules are installed , nvidia_persistenced service activated (reboot)

  • results:

AC922Power9/Tesla V100 <—DELL R740

94Gb/s to CPU memory
39Gb/s to GPU memory (GPUDirect)*

DELL R740/Quadro P6000 <---- R740

97Gb/s CPU Memory
78Gb/s GPU Direct

Most probably relaxed ordering mode is not enabled for the bifurcated PCIe/CAPI slot hosting the Mellanox (special IBM version) CX-5 card.

You can verify that with nvidia-smi:
nvidia-smi -i 0 -q|grep elax
Relaxed Ordering Mode : Disabled

and enable it by loading the nv_rsync_mem.ko kernel module, which should have been installed by your sysadmin:
sudo systemctl enable nv_rsync_mem
sudo systemctl start nv_rsync_mem
nvidia-smi -i 0 -q|grep elax
Relaxed Ordering Mode : Enabled

nv_rsync_mem usually comes as a package which is part of IBM Spectrum MPI distribution. To build it:
sudo rpmbuild --rebuild /opt/ibm/spectrum_mpi_gpusupport/nv_rsync_mem-1.0-2.src.rpm
sudo rpm -ihv /home/rpmbuild/RPMS/ppc64le/nv_rsync_mem-1.0-2.ppc64le.rpm

Note: the module must be loaded very early, right after nvidia.ko is loaded and before other GPU clients get started, so it is easier to reboot, after having installed the package, and rely on the systemd init script to enforce the right ordering.

Just noted that you claim to have nv_rsync_mem loaded.

  1. Still nvidia-smi should show that relaxed ordering is enabled.
  2. Which MOFED version are your running, as in ‘ofed_info|head’ ?
  3. Which CX-5 SKU do you have ? please attach output of ibv_devinfo.

There should be a single physical CX-5 card, which shows up with 4 specific BDFs:
lspci |grep ella
0003:01:00.0 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0003:01:00.1 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0033:01:00.0 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0033:01:00.1 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
ibv_devinfo |grep board_id
board_id: IBM0000000002
board_id: IBM0000000002
board_id: IBM0000000002
board_id: IBM0000000002

If you have a different setup, please contact IBM support team.

Thank you Davide,

you were right, Connectx5 relaxed mode was disabled,
I enabled it with:
sudo mlxconfig -d mlx5_0 set PCI_WR_ORDERING=1 ; then reboot
nv_rsync is loaded early at boot by service script right after nvidia-tesla:

nv_peer_mem             8513  0
nvidia_uvm           1107130  0
nvidia_modeset       1342589  0
nv_rsync_mem           16446  1
nvidia              21702034  31 nvidia_uvm,nv_peer_mem,nv_rsync_mem,nvidia_modeset

but once enabled, it does not change the BW…

IBMNPU
    Relaxed Ordering Mode       : Enabled
PCI
    Bus                         : 0x04
    Device                      : 0x00
    Domain                      : 0x0004
    Device Id                   : 0x1DB510DE

GPU0 and mlx5_0 are on the same Power9/PCIe

our Cnx5 hw has a different ID than yours (i checked that sniffer is off)
0000:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0000:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0030:01:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
0030:01:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 16.24.8000
node_guid: 9803:9b03:0005:2816
sys_image_guid: 9803:9b03:0005:2816
vendor_id: 0x02c9
vendor_part_id: 4121
hw_ver: 0x0
board_id: IBM0000000020

We have MLNX_OFED_LINUX-4.7-3.2.9.0-ubuntu18.04 on a debian system. No issue during installation.

ii  ibverbs-utils                               41mlnx1-OFED.4.7.0.0.2.47329          ppc64el      Examples for the libibverbs library
ii  libibverbs-dev                              41mlnx1-OFED.4.7.0.0.2.47329          ppc64el      Development files for the libibverbs library
ii  libibverbs1                                 41mlnx1-OFED.4.7.0.0.2.47329          ppc64el      Library for direct userspace use of RDMA (InfiniBand/iWARP)
ii  libmlx5-1                                   41mlnx1-OFED.4.7.0.3.3.47329          ppc64el      Userspace driver for Mellanox ConnectX InfiniBand HCAs
ii  libmlx5-dev                                 41mlnx1-OFED.4.7.0.3.3.47329          ppc64el      Development files for the libmlx5 driver
ii  librdmacm-dev                               41mlnx1-OFED.4.7.3.0.6.47329          ppc64el      Development files for the librdmacm library
ii  librdmacm1                                  41mlnx1-OFED.4.7.3.0.6.47329          ppc64el      Userspace RDMA Connection Manager
ii  mlnx-ofed-kernel-dkms                       4.7-OFED.4.7.3.2.9.1.g457f064         all          DKMS support for mlnx-ofed kernel modules
ii  mlnx-ofed-kernel-utils                      4.7-OFED.4.7.3.2.9.1.g457f064         ppc64el      Userspace tools to restart and tune mlnx-ofed kernel modules

Linux scisoft15 4.19.0-6-powerpc64le #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) ppc64le GNU/Linux
nv_rsync_mem from spectrum MPI.

something strange reported by nvidia-smi on Tesla PCIe PCIe width not set at x16:

IBMNPU
    Relaxed Ordering Mode       : Enabled
PCI
    Bus                         : 0x04
    Device                      : 0x00
    Domain                      : 0x0004
    Device Id                   : 0x1DB510DE
    Bus Id                      : 00000004:04:00.0
    Sub System Id               : 0x124910DE
    GPU Link Info
        PCIe Generation
            Max                 : 3
            Current             : 3
        Link Width
            Max                 : 16x
            Current             : 2x

lspci report the same:
LnkSta: Speed 8GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

but it would not be consistent with the observed BW to GPU by a factor of 2, neither with BW to CPU memory (97%) ???

Overall I see a disconnect regarding the SW levels installed on your machine. Are you in contact with your IBM support?

should not be enabled… it’s a totally different thing. Please switch it off.

You should have instead:
IBM_TUNNELED_ATOMIC_EN True(1)
IBM_AS_NOTIFY_EN True(1)
IBM_CAPI_EN True(1)

The relaxed ordering I mentioned is a different concept, related to loading nv_rsync_mem, and only works for NICs plugged in a specific slot of the AC922 which has 2 x8 links in a x16 (known as bifurcated slot).
If you plugged the NIC in a different slot, relaxed ordering will not be enabled for that, hence performance will lag.
Note that:
IBMNPU
Relaxed Ordering Mode : Enabled
simply means that nv_rsync_mem has been loaded, not that the system has all the pieces in the right place.

  1. AFAIK you should be using MLNX_OFED_LINUX-4.5-2.2.9.0 or later from the POWER9 4.5 branch

  2. why “Linux scisoft15 4.19.0-6-powerpc64le #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) ppc64le GNU/Linux” ??? AFAIK the only good distro is RH7.x with custom kernel.

  3. GPU PCIe link x2 is correct, as the bulk of the traffic, including GPUDirect RDMA, runs over the CPU-GPU NVLink bus.

of the 3 IBM settings, only IBM_TUNNELED_ATOMIC_EN was disabled. I enable it and reboot. same BW result…
I guess that nv_rsync_mem found the Connectx5:

[   11.901472] nv_rsync_mem:init_nic_list:327 - found pci_dev=000000006610ec8f 0000:01:00.0 0000:01:00.0 devfn=0 vendor=15b3 device=1019 dma_mask=ffffffffffffffff
[   11.901481] nv_rsync_mem:init_nic_list:327 - found pci_dev=00000000911fcd4e 0000:01:00.1 0000:01:00.1 devfn=1 vendor=15b3 device=1019 dma_mask=ffffffffffffffff

We are working remotely. is it possible to tell from lspci if the Connectx-5 is in the right slot ?

0000:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0000:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0030:01:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
0030:01:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

HINTS: I don’t know if it could be the cause of the issue:

it is not clear if the service /usr/lib/systemd/system/nv_rsync_mem.service load nv_rsync_mem during boot process prior to the other client.

lsmod display shows nvidia-modeset and nvidia-drm before nv_rsync_mem

nv_rsync_mem           16446  1
nvidia_drm             60328  0
nvidia_modeset       1342589  1 nvidia_drm
nvidia              21702034  29 nv_rsync_mem,nvidia_modeset

same info in dmesg

[    9.894061] nvidia 0035:03:00.0: Using 64-bit DMA iommu bypass
[   10.021445] ipmi-powernv ibm,opal:ipmi: Found new BMC (man_id: 0x00a741, prod_id: 0x424f, dev_id: 0x00)
[   10.264170] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  418.116.00  Thu Nov 14 18:39:40 UTC 2019
[   10.509044] [drm] [nvidia-drm] [GPU ID 0x00040400] Loading driver
[   10.509048] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0004:04:00.0 on minor 1
[   10.509125] [drm] [nvidia-drm] [GPU ID 0x00350300] Loading driver
[   10.509126] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0035:03:00.0 on minor 2
[   10.737155] audit: type=1400 audit(1587603837.632:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=2076 comm="apparmor_parser"
[   10.737161] audit: type=1400 audit(1587603837.632:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=2076 comm="apparmor_parser"
[   11.901462] nv_rsync_mem:init_nic_list:313 - searching for Mellanox ConnectX-5 PCIe device: venid=15b3 devid=1019
[   11.901472] nv_rsync_mem:init_nic_list:327 - found pci_dev=000000006610ec8f 0000:01:00.0 0000:01:00.0 devfn=0 vendor=15b3 device=1019 dma_mask=ffffffffffffffff

SOFTWARE STACK

OFED was downloaded from mellanox

debian PPC64le kernel/distro were installed by sysadmin.

ii  linux-image-4.19.0-6-powerpc64le            4.19.67-2+deb10u2                     ppc64el      Linux 4.19 for Little-endian 64-bit PowerPC

ii  nvidia-tesla-418-alternative                418.116.00-3~bpo10+1                  ppc64el      allows the selection of NVIDIA as GLX provider (Tesla 418 version)
ii  nvidia-tesla-418-driver                     418.116.00-3~bpo10+1                  ppc64el      NVIDIA metapackage (Tesla 418 version)
ii  nvidia-tesla-418-driver-bin                 418.116.00-3~bpo10+1                  ppc64el      NVIDIA driver support binaries (Tesla 418 version)
ii  nvidia-tesla-418-driver-libs:ppc64el        418.116.00-3~bpo10+1                  ppc64el      NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries) (Tesla 418 version)
ii  nvidia-tesla-418-egl-icd:ppc64el            418.116.00-3~bpo10+1                  ppc64el      NVIDIA EGL installable client driver (ICD)
ii  nvidia-tesla-418-kernel-dkms                418.116.00-3~bpo10+1                  ppc64el      NVIDIA binary kernel module DKMS source (Tesla 418 version)
ii  nvidia-tesla-418-kernel-support             418.116.00-3~bpo10+1                  ppc64el      NVIDIA binary kernel module support files (Tesla 418 version)
ii  nvidia-tesla-418-smi                        418.116.00-3~bpo10+1                  ppc64el      NVIDIA System Management Interface (Tesla 418 version)
ii  nvidia-tesla-418-vdpau-driver:ppc64el       418.116.00-3~bpo10+1                  ppc64el      Video Decode and Presentation API for Unix - NVIDIA driver (Tesla 418)
ii  nvidia-tesla-driver                         418.116.00-3~bpo10+1                  ppc64el      transition to nvidia-tesla-418-driver

can you confirm that at software level, we do not need special programming tricks ?
my code is based on standard libibverbs:

  • ibv_reg_mr() of
  • a cuMemAlloc() GPU buffer (without any special attribute) ,
  • RoCE UC Queue pair,
  • WRITE verbs

I have the expected performance transferring to CPU memory.

problem partly solved

I guess we have to use unified memory to get full BW but I am not sure it is still using really GPUDirect:

using cudaMallocManaged() instead of cuMemAlloc() in my code, I have now 94 Gb/s throughput (it was 39Gb/s previously with standard GPU allocation).
Data are accessible both from CPU pointer (not required in my use case) and from a GPU kernel.

I suspect there is an intermediary step in CPU memory. Is this true ?