My team is currently standing up a new cluster that has an SN2700 core ethernet switch on our boot network. LAG links are working fine between this core and the leaf switches in the new cluster. We also have an older cluster with an SX1036 ethernet switch serving as its core switch. LAG links are also working fine between this older core switch and the older leaf switches in that cluster. Several of us have tried to get LAG working between the SX1036 and SN2700 and we can’t working link (single link works fine). We’ve done typical troubleshooting looking for bad cables/ports etc. We can find no differences comparing the configurations and status for working LAG links and the failing link.
The SX1036 is a PPC switch and is running a much older firmware:
Product name: MLNX-OS
Product release: 3.4.3002
Build ID: #1-dev
Build date: 2015-07-30 20:13:15
Target arch: ppc
Target hw: m460ex
Built by: jenkins@fit74
Version summary: PPC_M460EX 3.4.3002 2015-07-30 20:13:15 ppc
Product model: ppc
than the SN2700 (X86):
Product name: MLNX-OS
Product release: 3.6.3200
Build ID: #1-dev
Build date: 2017-03-09 17:55:58
Target arch: x86_64
Target hw: x86_64
Built by: jenkins@e3f42965d5ee
Version summary: X86_64 3.6.3200 2017-03-09 17:55:58 x86_64
Product model: x86onie
The obvious thing to try is updating the firmware on the SX1036, but this cluster is in production and our team is nervous about messing with that core switch as it’s pretty critical to our infrastructure. Would a firmware mismatch cause this behavior.
I have seen documentation indicating that MLAG doesn’t work between PPC and X86 switches. I sure hope that’s not the case for LAG…