Hi. We’re trying to debug issues we see periodically with Networking on top of CX-3 and CX-4 based RoCEv1) fabrics using SR-IOV for connections from clients running as KVM guests (servers are bare-metal). When we hit these errors we see drop/error counters going up on the hosts.
So far all simple tests between host-pairs look ok, now we want to test congestion scenarios, e.g., 2 hosts sending to 1 host. ’ discovered that whilst e.g. ib_write_bw has an option to specify more than one QP, it actually doesn’t support it! Is there a simple way to engineer such a test or are we going to have to write something or move to an MPI based test suite…?
message_8578.pdf (446 KB)
Thank you for the comprehensive answer, however please note that this is a RoCE(v1) fabric, i.e., there is no IB link-layer, so almost all of the troubleshooting tips you provided do not apply (directly). I would dearly love to see the same sort of guide for RoCE.
For monitoring and analyse your network , I would suggest a new open tool provided by Mellanox, and is called NEO, it is a network management interface to analyse, monitor and diagnostics your ethernet network, it also works for RoCE.
Let’s have a look at :