I am using Connectx5-EX on 5.19.0-45-generic. One side creates an RC qp and sends it’s qp info over TCP and the other side upon receiving this message creates it’s own RC QP and sets it to RTR followed by setting to RTS, and it responds back over same TCP connection, to the initiator which then sets it’s own QP to RTR then RTS. This works most of the times but sometimes I get connection time out from ibv_modify_qp() while setting to RTR. I have two dual function Connectx5-Ex so total four interfaces mlx5_0 to mlx5_3, so far I have observed this on mlx5_2 on a particular node which could indicate an adapter or cable issue, but I was looking for some more info to prove that. So I tried to set the /sys/module/mlx5_core/parameters/debug_mask to 3 and found no mlx5_core messages in syslog also I was looking to run some of mst tools and it says running mlxtrace would need a config file which I don’t have. I also tried running wqdump by referring the mft doc and that also didn’t help much. So any suggestion or pointers would be highly appreciated.
Welcome to the NVIDIA Developer forums!
Regarding possible cable or adapter link integrity issues, use of the
mlxlink tool within Mellanox Firmware Tools can provide much more information: https://docs.nvidia.com/networking/display/MFTv4250/mlxlink+Utility
mlxlink -d mlx5_2 -m -e -c would be a good starting point. Review of ‘Physical Counters and BER Info’ output may show Link Down Counter increments, Link Error Recovery Counter increments, or high effective Bit Error Rate (greater than 15E-255) which could indicate a problem with either link integrity, or link training.
If abnormal output appears there, as a first troubleshooting step, we would recommend attempting to swap the cable with a known good cable, and testing again. If the issue persists, we would recommend using a known good adapter if available.
mlxtrace is used to gather hardware events, however both the configuration file required to parse this and the event flow it contains are proprietary - unfortunately, we cannot provide this.
If you are still unable to resolve this condition after the above steps are taken, and you have a valid NVIDIA Enterprise Networking Support entitlement, we would recommend opening a support ticket so further assistance can be rendered: https://enterprise-support.nvidia.com/s/create-case
NVIDIA Enterprise Experience
Thanks @ssimcoejr for the reply. Is there any ethtool or other counter to detect a CQ overrun scenario ?
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.