Hi all,
I am having some trouble running jobs in my RoCE mini cluster (ConnectX-6 MT4123). When I use ibv_post_send
to issue an RDMA READ request with its size larger than ethernet MTU size (e.g., 2000 bytes), then I got transport retry counter exceeded
. It works fine if I manually split a large RDMA READ into multiple small RDMA READ requests (e.g., 1400 bytes).
I wonder if it has something to do with my lossy RoCE acceleration settings. Here is my lossy RoCE acceleration settings:
Sending access register...
Field Name | Data
===============================================
roce_adp_retrans_field_select | 0x00000001
roce_tx_window_field_select | 0x00000001
roce_slow_restart_field_select | 0x00000001
roce_adp_retrans_en | 0x00000001
roce_tx_window_en | 0x00000001
roce_slow_restart_en | 0x00000001
===============================================
My devinfo:
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.34.1002
Hardware version: 0
Node GUID: 0xb83fd20300972b68
System image GUID: 0xb83fd20300972b68
Port 1:
State: Active
Physical state: LinkUp
Rate: 25
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xba3fd2fffe972b68
Link layer: Ethernet
Thanks.