I’d like to measure GPU-GPU ib_read / write bandwidth with varying combination of our PCIe topology within a single node.
But, --use_cuda option doesn’t work properly.
Could you give me help for going further step to measure bandwidth?
- I want to know whether my test command is correct or not as well.
- With -d and --use_cuda option, can I test all combination of GPU - HCA (Server) - HCA - GPU (Client) data communication?
Software Info
- OS : ubuntu 20.04
- linux-rdma/perftest (tag: v4.5-0.2)
- NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2
- MLNX_OFED_LINUX-5.3-1.0.0.1 (OFED-5.3-1.0.0)
ib_read_bw
Server
$ ./ib_read_bw -d mlx5_1 --use_cuda=0 -p 50001
- Waiting for client to connect… *
initializing CUDA
Listing all CUDA devices in system:
RDMA_Read BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
local address: LID 0x04 QPN 0x1854 PSN 0xd2b3ee OUT 0x10 RKey 0x00277e VAddr 0x007faec3210000
remote address: LID 0x03 QPN 0x1000 PSN 0xdfe56e OUT 0x10 RKey 0x002467 VAddr 0x007f5643210000
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
ethernet_read_keys: Couldn’t read remote address
Unable to read to socket/rdma_cm
Failed to exchange data between server and clients
Client
$ ./ib_read_bw 127.0.0.1 -d mlx5_2 --use_cuda=7 -p 50001
initializing CUDA
Listing all CUDA devices in system:
RDMA_Read BW Test
Dual-port : OFF Device : mlx5_2
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
local address: LID 0x03 QPN 0x1000 PSN 0xdfe56e OUT 0x10 RKey 0x002467 VAddr 0x007f5643210000
remote address: LID 0x04 QPN 0x1854 PSN 0xd2b3ee OUT 0x10 RKey 0x00277e VAddr 0x007faec3210000
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
mlx5: ai004: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008914 10001000 0000b0d2
Completion with error at client
Failed status 11: wr_id 0 syndrom 0x89
scnt=128, ccnt=0
Failed to complete run_iter_bw function successfully
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 SYS PXB SYS SYS SYS SYS 48-63,176-191 3
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 SYS PXB SYS SYS SYS SYS 48-63,176-191 3
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 PXB SYS SYS SYS SYS SYS 16-31,144-159 1
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 PXB SYS SYS SYS SYS SYS 16-31,144-159 1
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS SYS PXB 112-127,240-255 7
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS SYS PXB 112-127,240-255 7
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS PXB SYS SYS SYS 80-95,208-223 5
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS PXB SYS SYS SYS 80-95,208-223 5
mlx5_0 SYS SYS PXB PXB SYS SYS SYS SYS X SYS SYS SYS SYS SYS
mlx5_1 PXB PXB SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS
mlx5_2 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS X SYS SYS SYS
mlx5_3 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX SYS
mlx5_4 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X SYS
mlx5_5 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS X