Some NICs not reaching its available bandwidth (MT28908 [ConnectX-6])

elio.vp · April 14, 2022, 2:53pm

Hey!

So we’re facing an annoying problem.

We have multiple servers with identical components and identical drivers/software/settings.
However, some NICs are not achieving its required bandwidth while everything is exactly the same.
Even the BIOS settings are identical.

Server 1 for example:
[ 14.962184] mlx5_core 0000:c4:00.0: firmware version: 20.32.1010
[ 14.962244] mlx5_core 0000:c4:00.0: 252.048 Gb/s available PCIe bandwidth (16 GT/s x16 link)
[ 14.970402] mlx5_core 0000:c4:00.0: handle_hca_cap:692:(pid 787): log_max_qp value in current profile is 18, changing it to HCA capability limit (17)
[ 15.126171] mlx5_core 0000:c4:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[ 15.144644] mlx5_core 0000:c4:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)

While server 2
[ 14.309102] mlx5_core 0000:c4:00.0: firmware version: 20.32.1010
[ 14.309133] mlx5_core 0000:c4:00.0: 63.012 Gb/s available PCIe bandwidth, limited by 16 GT/s x4 link at 0000:c0:03.1 (capable of 252.048 Gb/s with 16 GT/s x16 link)
[ 14.316864] mlx5_core 0000:c4:00.0: handle_hca_cap:692:(pid 995): log_max_qp value in current profile is 18, changing it to HCA capability limit (17)
[ 14.529035] mlx5_core 0000:c4:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[ 14.529944] mlx5_core 0000:c4:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)

As you can see, on server 2, clearly not reaching its bandwidth.

The only thing that I can think of is that “maybe” some of the bios settings that have “auto” might be configured differently.
I’ve tried numerous of settings and always end up with the same result.

Could use some pointers where to look :)

Thank you in advance!

MvB · April 14, 2022, 11:58pm

Hello elio.vp,

Thank you for posting your inquiry on the NVIDIA Developer Forum - Infrastructure and Networking - Section.

Based on your information provided, please make sure you do not set any BIOS settings to auto. Auto settings do not always guarantee the correct outcome. Make sure all BIOS versions and settings are completely identical.

If you are certain that this is all accomplished, then swapping the adapters around would be the next step, before submitting an RMA of the adapter.

In the node, which is working as expected, swap out the adapter for one of the known bad performance one and check if the issue follows the adapter. If the issue follows the adapter, we recommend to issue a RMA. If the issue stays with the node, we recommend to contact the system vendor. Please make sure the system vendor, certified the adapter.

Thank you and regards,
~NVIDIA Networking Technical Support

elio.vp · April 15, 2022, 9:34am

Hey MvB and thank you for your reply.

It’s not the NIC that needs to be RMAd.

It’s 100% sure a BIOS setting (or multiple in combination)
This is what I was trying to ask.

But it’s fine, we will figure it out.

Greetings

Elio

MvB · April 20, 2022, 3:24pm

Hello elio.vp,

Still there is a possibility that the adapter is at fault. The provided triage steps will determine if it is the system node or the adapter.

If it is the adapter, and you still have warranty or a valid support contract on it, please do not hesitate to open a RMA request → https://support.mellanox.com/s/public-rma

Thank you and regards,
~NVIDIA Networking Technical Support

elio.vp · April 20, 2022, 8:10pm

NICs were not seated properly

For anyone else that might run into the same issue.
Check that the NIC is seated firmly enough into the pcie slot.

We had several servers where this was not the case and the system downgraded it to either x8, x4 or even x2.

Cheers

system · May 4, 2022, 8:11pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.