Environment:
CPU: Intel(R) Xeon(R) 6960P
OS: Ubuntu 22.04
NIC PN: MCX75310AAS-NEAT
Driver: MLNX_OFED_LINUX-23.10-3.2.2.0
Firmware: 28.41.1000 (MT_0000000838)
Problem:
The bandwidth does not meet specification in 1 to 1 ib_write_bw test. The receiver nic continuously send pfc pauses cause a rdma write bw drop to about 220Gb/s with DCQCN and PFC;Performance better when turn to RTTCC and PFC but still under expectation.
Performance result with DCQCN and PFC:
PFC statistics on ethernet switch(not from nvidia) with DCQCN+PFC
Performance result with RTT-CC and PFC
PFC statistics on ethernet switch(not from nvidia) with RTT-CC+PFC
Hi,
Thanks for your questions.
The firmware and driver versions are a bit outdated, and we recommend to use the latest driver.
Firmware.
For QoS configuration for RoCE you can look at the below link:
To eliminate a switch issue, you may try running back to back test, and check if the degradation is still observed.
Performance degradation troubleshooting may require collecting much more information, and a deeper learning of the scenario. To proceed with this, a support case will be required.
The case can be opened by sending an email to enterprisesupport@nvidia.com, and it will be handled according to entitlement.
Best Regards,
Anatoly