GPU: 8x H100-80G-SXM with 4x NVSwitch
Device Type: ConnectX7
Description: NVIDIA ConnectX-7 Single Port Infiniband NDR OSFP Adapter
Versions: Current Available
FW 28.41.1000 N/A
PXE 3.7.0400 N/A
UEFI 14.34.0012 N/A
Software Stacks:
Datacenter Driver: 570.133.20 (open-kernel version)
Cuda toolkit: 12.8
MOFED: MLNX_OFED_LINUX-24.10-2.1.8.0-ubuntu22.04-x86_64
Ubuntu:22.04.03 on Baremetal
kernel: 5.15.0-139-generic
gdrcopy: 2.5 (release)
ucx: 1.18.1
perftest: 25.01.0 (release)
- ACS has been disabled on OS
sudo lspci -vvv | grep ACSCtl
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
… - ATS has been disabled on all 8x CX-7
mlxconfig -d mlx5_0 query | grep ATS_ENABLED; done
ATS_ENABLED False(0)
Sanity Test:
-
GDRcopy has passed
$ gdrcopy_sanity
Total: 36, Passed: 31, Failed: 0, Waived: 5
List of waived tests:
basic_v2_forcepci_cumemalloc
basic_v2_forcepci_vmmalloc
basic_with_tokens
data_validation_mix_mappings_cumemalloc
data_validation_v2_forcepci_cumemalloc -
perftest without cuda works, got expected 396Gb/s on all message size
**3. Got segfault perftest with cuda in both cases:
- using adding --use_cuda_dmabuf
- without it (i.e., using nvidia_peermem)**
$ gdb -q --args ./ib_send_bw -a --report_gbits -d mlx5_0 --use_cuda=0 --use_cuda_dmabuf
Reading symbols from ./ib_send_bw…
(gdb) run
Starting program: /home/vmware/perftest-25.01.0/ib_send_bw -a -q 4 --report_gbits -d mlx5_0 --use_cuda=0 --use_cuda_dmabuf
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.
WARNING: BW peak won’t be measured in this run.
Perftest doesn’t supports CUDA tests with inline messages: inline size set to 0
- Waiting for client to connect… *
initializing CUDA
[New Thread 0x7ffff359d640 (LWP 81912)]
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 19:00
CUDA device 1: PCIe address is 3B:00
CUDA device 2: PCIe address is 4C:00
CUDA device 3: PCIe address is 5D:00
CUDA device 4: PCIe address is 9B:00
CUDA device 5: PCIe address is BB:00
CUDA device 6: PCIe address is CB:00
CUDA device 7: PCIe address is DB:00
Picking device No. 0
[pid = 81871, dev = 0] device name = [NVIDIA H100 80GB HBM3]
creating CUDA Ctx
[New Thread 0x7ffde1339640 (LWP 81939)]
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 67108864 bytes GPU buffer
allocated GPU buffer address at 00007ffdb2000000 pointer=0x7ffdb2000000
using DMA-BUF for GPU buffer address at 0x7ffdb2000000 aligned at 0x7ffdb2000000 with aligned size 67108864
Calling ibv_reg_dmabuf_mr(offset=0, size=67108864, addr=0x7ffdb2000000, fd=67) for QP #0
Send BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 4 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
RX depth : 512
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
local address: LID 0x09 QPN 0x0067 PSN 0xbd0ea
local address: LID 0x09 QPN 0x0068 PSN 0x39998c
local address: LID 0x09 QPN 0x0069 PSN 0xfba9e6
local address: LID 0x09 QPN 0x006a PSN 0xfa9a3d
remote address: LID 0x08 QPN 0x0071 PSN 0x92add
remote address: LID 0x08 QPN 0x0072 PSN 0xed3803
remote address: LID 0x08 QPN 0x0073 PSN 0x890c91
remote address: LID 0x08 QPN 0x0074 PSN 0x773f0c
bytes iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
Thread 1 “ib_send_bw” received signal SIGSEGV, Segmentation fault.
__memmove_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:373
373 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0 __memmove_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:373
#1 0x00007ffff7e82b06 in ?? () from /lib/x86_64-linux-gnu/libmlx5.so.1
#2 0x00007ffff7e5c34e in ?? () from /lib/x86_64-linux-gnu/libmlx5.so.1
#3 0x0000555555579ce6 in ibv_poll_cq (wc=0x5555555ddfa0, num_entries=16, cq=) at /usr/include/infiniband/verbs.h:2927
#4 run_iter_bw_server (ctx=ctx@entry=0x7fffffffcd80, user_param=user_param@entry=0x7fffffffcfc0) at src/perftest_resources.c:3832
#5 0x000055555555c4e3 in main (argc=, argv=) at src/send_bw.c:458
(gdb) frame 3
#3 0x0000555555579ce6 in ibv_poll_cq (wc=0x5555555ddfa0, num_entries=16, cq=) at /usr/include/infiniband/verbs.h:2927
2927 return cq->context->ops.poll_cq(cq, num_entries, wc);
GDB shows the issue resides in libmlx5.so, so I suspect the reason resides in the MLX-OFED
Could you suggest a specific combination of datacenter driver version, MLX-OFED / DOCA-OFED version, cuda version, nccl version, gdrcopy verision, perftest version to make GDR validation work on bare metal?