Segfault when testing GPU RDMA on Connectx-7 IB

fyuankun · May 7, 2025, 5:44pm

GPU: 8x H100-80G-SXM with 4x NVSwitch

Device Type: ConnectX7
Description: NVIDIA ConnectX-7 Single Port Infiniband NDR OSFP Adapter
Versions: Current Available
FW 28.41.1000 N/A
PXE 3.7.0400 N/A
UEFI 14.34.0012 N/A

Software Stacks:
Datacenter Driver: 570.133.20 (open-kernel version)
Cuda toolkit: 12.8
MOFED: MLNX_OFED_LINUX-24.10-2.1.8.0-ubuntu22.04-x86_64
Ubuntu:22.04.03 on Baremetal
kernel: 5.15.0-139-generic
gdrcopy: 2.5 (release)
ucx: 1.18.1
perftest: 25.01.0 (release)

ACS has been disabled on OS
sudo lspci -vvv | grep ACSCtl
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
…
ATS has been disabled on all 8x CX-7
mlxconfig -d mlx5_0 query | grep ATS_ENABLED; done
ATS_ENABLED False(0)

Sanity Test:

GDRcopy has passed
$ gdrcopy_sanity
Total: 36, Passed: 31, Failed: 0, Waived: 5
List of waived tests:
basic_v2_forcepci_cumemalloc
basic_v2_forcepci_vmmalloc
basic_with_tokens
data_validation_mix_mappings_cumemalloc
data_validation_v2_forcepci_cumemalloc
perftest without cuda works, got expected 396Gb/s on all message size

**3. Got segfault perftest with cuda in both cases:

using adding --use_cuda_dmabuf
without it (i.e., using nvidia_peermem)**

$ gdb -q --args ./ib_send_bw -a --report_gbits -d mlx5_0 --use_cuda=0 --use_cuda_dmabuf
Reading symbols from ./ib_send_bw…
(gdb) run
Starting program: /home/vmware/perftest-25.01.0/ib_send_bw -a -q 4 --report_gbits -d mlx5_0 --use_cuda=0 --use_cuda_dmabuf
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.
WARNING: BW peak won’t be measured in this run.
Perftest doesn’t supports CUDA tests with inline messages: inline size set to 0

Waiting for client to connect… *

initializing CUDA
[New Thread 0x7ffff359d640 (LWP 81912)]
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 19:00
CUDA device 1: PCIe address is 3B:00
CUDA device 2: PCIe address is 4C:00
CUDA device 3: PCIe address is 5D:00
CUDA device 4: PCIe address is 9B:00
CUDA device 5: PCIe address is BB:00
CUDA device 6: PCIe address is CB:00
CUDA device 7: PCIe address is DB:00

Picking device No. 0
[pid = 81871, dev = 0] device name = [NVIDIA H100 80GB HBM3]
creating CUDA Ctx
[New Thread 0x7ffde1339640 (LWP 81939)]
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 67108864 bytes GPU buffer
allocated GPU buffer address at 00007ffdb2000000 pointer=0x7ffdb2000000
using DMA-BUF for GPU buffer address at 0x7ffdb2000000 aligned at 0x7ffdb2000000 with aligned size 67108864
Calling ibv_reg_dmabuf_mr(offset=0, size=67108864, addr=0x7ffdb2000000, fd=67) for QP #0

                Send BW Test

Dual-port : OFF Device : mlx5_0
Number of qps : 4 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
RX depth : 512
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet

local address: LID 0x09 QPN 0x0067 PSN 0xbd0ea
local address: LID 0x09 QPN 0x0068 PSN 0x39998c
local address: LID 0x09 QPN 0x0069 PSN 0xfba9e6
local address: LID 0x09 QPN 0x006a PSN 0xfa9a3d
remote address: LID 0x08 QPN 0x0071 PSN 0x92add
remote address: LID 0x08 QPN 0x0072 PSN 0xed3803
remote address: LID 0x08 QPN 0x0073 PSN 0x890c91
remote address: LID 0x08 QPN 0x0074 PSN 0x773f0c

bytes iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]

Thread 1 “ib_send_bw” received signal SIGSEGV, Segmentation fault.
__memmove_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:373
373 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0 __memmove_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:373
#1 0x00007ffff7e82b06 in ?? () from /lib/x86_64-linux-gnu/libmlx5.so.1
#2 0x00007ffff7e5c34e in ?? () from /lib/x86_64-linux-gnu/libmlx5.so.1
#3 0x0000555555579ce6 in ibv_poll_cq (wc=0x5555555ddfa0, num_entries=16, cq=) at /usr/include/infiniband/verbs.h:2927
#4 run_iter_bw_server (ctx=ctx@entry=0x7fffffffcd80, user_param=user_param@entry=0x7fffffffcfc0) at src/perftest_resources.c:3832
#5 0x000055555555c4e3 in main (argc=, argv=) at src/send_bw.c:458
(gdb) frame 3
#3 0x0000555555579ce6 in ibv_poll_cq (wc=0x5555555ddfa0, num_entries=16, cq=) at /usr/include/infiniband/verbs.h:2927
2927 return cq->context->ops.poll_cq(cq, num_entries, wc);

GDB shows the issue resides in libmlx5.so, so I suspect the reason resides in the MLX-OFED
Could you suggest a specific combination of datacenter driver version, MLX-OFED / DOCA-OFED version, cuda version, nccl version, gdrcopy verision, perftest version to make GDR validation work on bare metal?

fyuankun · May 8, 2025, 6:06pm

perftest team has helped me to resolve the issue: Segfault when testing GPU RDMA on Connectx-7 IB on Ubuntu 22.04 · Issue #326 · linux-rdma/perftest · GitHub

Topic		Replies	Views
MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA RDMA Software For GPU problem , configurations	8	6767	March 28, 2020
S1070 device 0 broken Test case provided CUDA Programming and Performance	10	4344	June 9, 2009
RDMA GPUDirect//nvidia-peer-memory/cuda issue RDMA Software For GPU software-and-drivers , howto-enable-verify-and-troubleshoo	11	2266	September 12, 2019
ibv_reg_mr got file exists error when used nv_peer_mem	2	500	September 9, 2017
GPUDirect RDMA at the ibverbs level. Software And Drivers iterations , bytes	4	1696	November 30, 2020
SDK sample code failures only on samples that launch a kernel CUDA Programming and Performance	17	8740	January 7, 2009
gdrcpy problem GPU-Accelerated Libraries	1	1379	February 25, 2016
bandwidthTest segfaults CUDA Programming and Performance	2	4236	October 10, 2011
Seg Faults with GTX 295 in compute exclusive mode CUDA Programming and Performance	4	6682	October 13, 2009
Segmentation Fault after rebooting CUDA Programming and Performance	2	1360	August 11, 2011

Segfault when testing GPU RDMA on Connectx-7 IB

Software Stacks: Datacenter Driver: 570.133.20 (open-kernel version) Cuda toolkit: 12.8 MOFED: MLNX_OFED_LINUX-24.10-2.1.8.0-ubuntu22.04-x86_64 Ubuntu:22.04.03 on Baremetal kernel: 5.15.0-139-generic gdrcopy: 2.5 (release) ucx: 1.18.1 perftest: 25.01.0 (release)

Related topics

Software Stacks:
Datacenter Driver: 570.133.20 (open-kernel version)
Cuda toolkit: 12.8
MOFED: MLNX_OFED_LINUX-24.10-2.1.8.0-ubuntu22.04-x86_64
Ubuntu:22.04.03 on Baremetal
kernel: 5.15.0-139-generic
gdrcopy: 2.5 (release)
ucx: 1.18.1
perftest: 25.01.0 (release)