LAG problems

My team is currently standing up a new cluster that has an SN2700 core ethernet switch on our boot network. LAG links are working fine between this core and the leaf switches in the new cluster. We also have an older cluster with an SX1036 ethernet switch serving as its core switch. LAG links are also working fine between this older core switch and the older leaf switches in that cluster. Several of us have tried to get LAG working between the SX1036 and SN2700 and we can’t working link (single link works fine). We’ve done typical troubleshooting looking for bad cables/ports etc. We can find no differences comparing the configurations and status for working LAG links and the failing link.

The SX1036 is a PPC switch and is running a much older firmware:

Product name: MLNX-OS

Product release: 3.4.3002

Build ID: #1-dev

Build date: 2015-07-30 20:13:15

Target arch: ppc

Target hw: m460ex

Built by: jenkins@fit74

Version summary: PPC_M460EX 3.4.3002 2015-07-30 20:13:15 ppc

Product model: ppc

than the SN2700 (X86):

Product name: MLNX-OS

Product release: 3.6.3200

Build ID: #1-dev

Build date: 2017-03-09 17:55:58

Target arch: x86_64

Target hw: x86_64

Built by: jenkins@e3f42965d5ee

Version summary: X86_64 3.6.3200 2017-03-09 17:55:58 x86_64

Product model: x86onie

The obvious thing to try is updating the firmware on the SX1036, but this cluster is in production and our team is nervous about messing with that core switch as it’s pretty critical to our infrastructure. Would a firmware mismatch cause this behavior.

I have seen documentation indicating that MLAG doesn’t work between PPC and X86 switches. I sure hope that’s not the case for LAG…

Hi Rick,

LAG should work fine b/w SX1036 and the SN2700 switch. Only for MLAG we have the limitations of the cpu which should match for both the switches.

Can you please verify your configs?

Is this a regular LACP port channel b/w both the switches.

What is the status of the second port which you are bundling in a LACP? Is it up/down/suspended?

Please share me the details.

Thanks

Khwaja

We’re not using LACP. We actually got it working by changing the port mode to “hybrid” instead of “trunk”. All of our other LAG links work fine in trunk mode. We figure there’s a misconfiguration somewhere in our system causing this, but we have a bunch of switches running at this point. The hybrid workaround has bumped this pretty low on the priority queue, particularly as any debugging would likely bring down a link critical to production work. But I’m all ears if somebody has an idea why we have this issue. Thanks.