The bandwidth used to test the code is not the same as that tested by perftest

In order to learn RDMA, I found an example on the Internet, which is similar to the one provided by MELLANOX , but when I used two machines to run, I found the following problems:

1.There is a big gap between the bandwidth of the code tested and that tested by Perftest .

2.In addition to this, the use of GID 0 or 2 on one of the two machines will significantly reduce the bandwidth.

Machine A:

configure:

hca_id: mlx5_bond_0
        transport:                      InfiniBand (0)
        fw_ver:                         20.39.3004
        node_guid:                      1070:fd03:00e5:f118
        sys_image_guid:                 1070:fd03:00e5:f118
        vendor_id:                      0x02c9
        vendor_part_id:                 4123
        hw_ver:                         0x0
        board_id:                       MT_0000000224
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

DEV     PORT    INDEX   GID                                     IPv4            VER     DEV
---     ----    -----   ---                                     ------------    ---     ---
mlx5_bond_0     1       0       fe80:0000:0000:0000:b0fc:4eff:feb3:1112                 v1      bond0
mlx5_bond_0     1       1       fe80:0000:0000:0000:b0fc:4eff:feb3:1112                 v2      bond0
mlx5_bond_0     1       2       0000:0000:0000:0000:0000:ffff:0a77:2e3d 10.119.46.61    v1      bond0
mlx5_bond_0     1       3       0000:0000:0000:0000:0000:ffff:0a77:2e3d 10.119.46.61    v2      bond0

test in perftest on GID 1

---------------------------------------------------------------------------------------
                    RDMA_Read BW Test
RX depth:               1
post_list:              1
inline_size:            0
 Dual-port       : OFF          Device         : mlx5_bond_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 1
 Outstand reads  : 16
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x1659 PSN 0xd4858a OUT 0x10 RKey 0x203e00 VAddr 0x007f38d0d07000
 GID: 254:128:00:00:00:00:00:00:176:252:78:255:254:179:17:18
 remote address: LID 0000 QPN 0x1c86 PSN 0xc2e51a OUT 0x10 RKey 0x013f00 VAddr 0x007f123fc62000
 GID: 254:128:00:00:00:00:00:00:100:155:154:255:254:172:09:41
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      1000             10829.53            10829.17       0.173267
---------------------------------------------------------------------------------------

Machine B:

hca_id: mlx5_bond_0
        transport:                      InfiniBand (0)
        fw_ver:                         20.39.3004
        node_guid:                      e8eb:d303:0032:b212
        sys_image_guid:                 e8eb:d303:0032:b212
        vendor_id:                      0x02c9
        vendor_part_id:                 4123
        hw_ver:                         0x0
        board_id:                       MT_0000000224
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

DEV     PORT    INDEX   GID                                     IPv4              VER     DEV
---     ----    -----   ---                                     ------------      ---     ---
mlx5_bond_0     1       0       fe80:0000:0000:0000:649b:9aff:feac:0929                   v1      bond0
mlx5_bond_0     1       1       fe80:0000:0000:0000:649b:9aff:feac:0929                   v2      bond0
mlx5_bond_0     1       2       0000:0000:0000:0000:0000:ffff:0a77:2e3e   10.119.46.62    v1      bond0
mlx5_bond_0     1       3       0000:0000:0000:0000:0000:ffff:0a77:2e3e   10.119.46.62    v2      bond0
n_gids_found=4

test in perftest on GID 0

                    RDMA_Read BW Test
RX depth:               1
post_list:              1
inline_size:            0
 Dual-port       : OFF          Device         : mlx5_bond_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 1
 Outstand reads  : 16
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x1659 PSN 0xd4858a OUT 0x10 RKey 0x203e00 VAddr 0x007f38d0d07000
 GID: 254:128:00:00:00:00:00:00:176:252:78:255:254:179:17:18
 remote address: LID 0000 QPN 0x1c86 PSN 0xc2e51a OUT 0x10 RKey 0x013f00 VAddr 0x007f123fc62000
 GID: 254:128:00:00:00:00:00:00:100:155:154:255:254:172:09:41
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      1000             10829.53            10829.17       0.173267
---------------------------------------------------------------------------------------

If I test on the example code, the Bandwidth is about 0.0124GB/s when M1 use GID0 and M2 use GID0/GID1. And the Bandwidth is about 6GB/s when M1 use GID1 and M2 use GID1. I’d like to know what optimizations the perftest code has done, or what deficiencies the code in the example above has caused a big difference in the bandwidth of the tests.