Error sending RDMA READ when its size is larger than Ethernet MTU size

Hi all,
I am having some trouble running jobs in my RoCE mini cluster (ConnectX-6 MT4123). When I use ibv_post_send to issue an RDMA READ request with its size larger than ethernet MTU size (e.g., 2000 bytes), then I got transport retry counter exceeded. It works fine if I manually split a large RDMA READ into multiple small RDMA READ requests (e.g., 1400 bytes).

I wonder if it has something to do with my lossy RoCE acceleration settings. Here is my lossy RoCE acceleration settings:

Sending access register...

Field Name                        | Data    
===============================================
roce_adp_retrans_field_select     | 0x00000001
roce_tx_window_field_select       | 0x00000001
roce_slow_restart_field_select    | 0x00000001
roce_adp_retrans_en               | 0x00000001
roce_tx_window_en                 | 0x00000001
roce_slow_restart_en              | 0x00000001
===============================================

My devinfo:

CA 'mlx5_0'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.34.1002
        Hardware version: 0
        Node GUID: 0xb83fd20300972b68
        System image GUID: 0xb83fd20300972b68
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 25
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0xba3fd2fffe972b68
                Link layer: Ethernet

Thanks.

Problem solved. It turned out that I made a mistake in my code. qp_attr.path_mtu should be set to IBV_MTU_1024 in my case, not IBV_MTU_4096.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.