MSX6012 MLAG

szix96 · February 27, 2021, 2:25pm

Hello,

I have two 6012 switch, and want to configure MLAG.

But when I followed the guide attached the MLAG-VIP is not working only one node became master and the other is stuck in unknown.

The switches have version 3.6.8012

Attached the 2 sh run and all the information I have with two .txt file.

Please let me know if there is a know issue with this version or I did configured something working or is there a mlag issue with 3.6.8012 or the L3 eth license is not enough for mlag?

Thank you very much in advance.

BR,

Dominik

MLNX-OS_VPI_User_Manual_3.6.8012.pdf (7.7 MB)

MLNX2_sh_run_01.txt (11.1 KB)

MLNX1_sh_run_01.txt (21.2 KB)

szix96 · March 1, 2021, 4:27pm

Dear All,

I just created 2 full sysdumps from the 2 sx6012 I attached them to the update.

And noticed this in the logs: No mlag-vip - mgmt KA is not supported

Does anyone know what this means?

I tried to search online and in available documents, but I was unable to find any mention about this.

Once again Thank you in advance.

Br,Dominik

Mar 1 16:34:38 MLNX2 clusterd[8865]: [clusterd.NOTICE]: Cluster daemon launched

Mar 1 16:34:38 MLNX2 pm[4337]: [pm.NOTICE]: Launched clusterd (Clustering Daemon) with pid 8865

Mar 1 16:34:38 MLNX2 clusterd[8865]: [clusterd.NOTICE]: mlnx_init_golden_profile: golden-profile = 4

Mar 1 16:34:38 MLNX2 clusterd[8865]: [clusterd.NOTICE]: mlnx_init_internal: Initial HA profile=mlag(mellanox_none)

Mar 1 16:34:38 MLNX2 clusterd[8865]: [clusterd.NOTICE]: mlnx_init_internal: 5 mDNS node configuration tuples:

Mar 1 16:34:38 MLNX2 clusterd[8865]: [clusterd.NOTICE]: mlnx_init_internal: ha-profile = mlag

Mar 1 16:34:38 MLNX2 clusterd[8865]: [clusterd.NOTICE]: mlnx_init_internal: os-release = 3.6.8012

Mar 1 16:34:38 MLNX2 clusterd[8865]: [clusterd.NOTICE]: mlnx_init_internal: system-type = SX6012

Mar 1 16:34:38 MLNX2 clusterd[8865]: [clusterd.NOTICE]: mlnx_init_internal: golden-profile = 4

Mar 1 16:34:38 MLNX2 clusterd[8865]: [clusterd.NOTICE]: mlnx_init_internal: cpu-type = ppc

Mar 1 16:34:38 MLNX2 mlagd[5068]: TID 1251996816: [mlagd.NOTICE]: [MLAG_HEALTH.NOTICE] No mlag-vip - mgmt KA is not supported

Mar 1 16:34:38 MLNX2 clusterd[8865]: [clusterd.NOTICE]: set cluster master preference to 50

Mar 1 16:34:38 MLNX2 clusterd[8865]: [clusterd.NOTICE]: cluster mDNS enable"true"

Mar 1 16:34:38 MLNX2 clusterd[8865]: [clusterd.NOTICE]: Forking then execing binary /usr/bin/mDNSResponder with argv "/usr/bin/mDNSResponder -d -a 10.0.24.11".

Mar 1 16:34:38 MLNX2 clusterd[8865]: [clusterd.NOTICE]: Process with pid 8867: launched (not waiting for it)

Mar 1 16:34:38 MLNX2 mDNSResponder[8867]: mDNSResponder starting up

Mar 1 16:34:38 MLNX2 clusterd[8865]: [clusterd.NOTICE]: dnode publish: starting

Mar 1 16:34:39 MLNX2 mDNSResponder[8867]: ResolveSimultaneousProbe: done

Mar 1 16:34:39 MLNX2 last message repeated 5 times

Mar 1 16:34:39 MLNX2 mlagd[5068]: TID 1251996816: [mlagd.NOTICE]: [MLAG_HEALTH.NOTICE] No mlag-vip - mgmt KA is not supported

Mar 1 16:34:39 MLNX2 clusterd[8865]: [clusterd.NOTICE]: Cluster dnode publish: success

Mar 1 16:34:39 MLNX2 clusterd[8865]: [clusterd.NOTICE]: cl_disco_dnode_check_rename: hostid=123212341234 old_name=MLNX2 new_name=MLNX1(MLNX1-123212341234)

Mar 1 16:34:39 MLNX2 clusterd[8865]: [clusterd.ERR]: cl_disco_master_resolve_blocking: Unable to join cluster my-vip master cpu=/ppc ha=/mlag cnt=0

Mar 1 16:34:39 MLNX2 clusterd[8865]: [clusterd.NOTICE]: cl_disco_master_resolve_blocking: Restarting clusterd because unable to join cluster my-vip master cpu=/ppc ha=/mlag

Mar 1 16:34:39 MLNX2 clusterd[8865]: [clusterd.NOTICE]: Shutdown mDNSResponder with pid 8867

Mar 1 16:34:39 MLNX2 mgmtd[4397]: [mgmtd.NOTICE]: Async: timed out getting external response for type query_request session 432 id 40794 from clusterd-8865

Mar 1 16:34:39 MLNX2 mgmtd[4397]: [mgmtd.NOTICE]: Starting to dump backlogged messages: 0 in queue

Mar 1 16:34:39 MLNX2 mgmtd[4397]: [mgmtd.ERR]: md_system_get_layout_disk_names(), md_system.c:3027, build 1: Unexpected empty disk layout!

Mar 1 16:34:39 MLNX2 mgmtd[4397]: [mgmtd.ERR]: md_system_iterate_disk(), md_system.c:3108, build 1: Error code 14001 (unexpected NULL) returned

Mar 1 16:34:40 MLNX2 mlagd[5068]: TID 1251996816: [mlagd.NOTICE]: [MLAG_HEALTH.NOTICE] No mlag-vip - mgmt KA is not supported

Mar 1 16:34:41 MLNX2 mlagd[5068]: TID 1251996816: [mlagd.NOTICE]: [MLAG_HEALTH.NOTICE] No mlag-vip - mgmt KA is not supported

Mar 1 16:34:41 MLNX2 statsd[5006]: [statsd.NOTICE]: alarm 'cpu_util_indiv': triggered for rising clear for event cpu_util_indiv

Mar 1 16:34:41 MLNX2 mgmtd[4397]: [mgmtd.NOTICE]: Event occurred (CPU utilization has fallen back to normal levels), but no mailhub configured.

Mar 1 16:34:42 MLNX2 mlagd[5068]: TID 1251996816: [mlagd.NOTICE]: [MLAG_HEALTH.NOTICE] No mlag-vip - mgmt KA is not supported

eddies · February 28, 2021, 11:24am

Hi Dominik.

Do you have ping and SSH between the 2 mgmt0 ports of the switches?

Please remove the mlag-vip config from switch #[[2:

"(config)# no mlag-vip]

(config)# mlag-vip my-vip

szix96 · February 28, 2021, 10:04pm

Dear Eddie Shklaer,

Thank you very much for the reply.

Yes, ping and SSH works between the 2 nodes on mgmt0 to other mgmt0 ip address.

The sh run indeed odd because I typed these to the CLI on BOTH switch:

MLNX1: mlag-vip my-vip ip 10.0.24.12 /22 force

MLNX2: mlag-vip my-vip ip 10.0.24.12 /22 force

But I saw that the MLNX2 has this in the config as You rightly pointed out:

mlag-vip my-vip ip 0.0.0.0 /0 force

Either if I apply to the MLNX2 this Mlag config I get the same result: mlag-vip my-vip

sh run shows: mlag-vip my-vip ip 0.0.0.0 /0 force

And sadly the Mlag-vip is still in desync Master/unkown.

Thank you very much for your help.

Please let me know if it is a known bug or I do miss configure something.

Best Regards,

Dominik

szix96 · March 1, 2021, 4:27pm

other sysdump

szix96 · March 1, 2021, 4:47pm

Just saved the Logs files and it is showing like:

cl_disco_master_resolve_blocking: Unable to join cluster my-vip master cpu=/ppc ha=/mlag cnt=0

szix96 · March 1, 2021, 5:31pm

After I changed hostID it seems that the 2 msx6012 started to see each other still not in ha/Mlag-vip, but now new issue emerged: MLNX1_2_mlag_disk_error

MLNX1:

Mar 1 18:24:20 MLNX1 clusterd[5239]: [clusterd.NOTICE]: session 41: accepted connection from IP 10.0.24.11

Mar 1 18:24:20 MLNX1 clusterd[5239]: [clusterd.NOTICE]: CCL: Accepted connection from 10.0.24.11, port 38510

Mar 1 18:24:20 MLNX1 clusterd[5239]: [clusterd.ERR]: HMAC verification failed

Mar 1 18:24:20 MLNX1 clusterd[5239]: [clusterd.ERR]: cl_verify_hmac_md5(), cl_comm.c:1508, build 1: Error code 14613 returned

Mar 1 18:24:20 MLNX1 clusterd[5239]: [clusterd.ERR]: cl_ccl_msg_recv(), cl_comm.c:1559, build 1: HMAC verification failed

MLNX2:

Mar 1 17:55:15 MLNX2 clusterd[5532]: [clusterd.NOTICE]: Cluster dnode publish: success

Mar 1 17:55:15 MLNX2 clusterd[5532]: [clusterd.NOTICE]: Cluster master resolve: success: 10.0.24.10 60102 (MLNX1-123212341234)

Mar 1 17:55:15 MLNX2 clusterd[5532]: [clusterd.NOTICE]: session 1: enabled listening on IP 10.0.24.11

Mar 1 17:55:15 MLNX2 clusterd[5532]: [clusterd.NOTICE]: session 2: connected to IP 10.0.24.10

Mar 1 17:55:15 MLNX2 clusterd[5532]: [clusterd.NOTICE]: master change event sent: false

Mar 1 17:55:15 MLNX2 clusterd[5532]: [clusterd.NOTICE]: standby change event sent: false

Mar 1 17:55:15 MLNX2 clusterd[5532]: [clusterd.NOTICE]: master vip: cleared any old vip

Mar 1 17:55:15 MLNX2 clusterd[5532]: [clusterd.NOTICE]: Master has been detected , update mDNSResponder enable state [Enabled]

Mar 1 17:55:15 MLNX2 clusterd[5532]: [clusterd.NOTICE]: cl_config_mdns_enable: role=unknown

Mar 1 17:55:15 MLNX2 mDNSResponder[5533]: mDNSResponder allow traffic send value changed from : Allowed to : Allowed

Mar 1 17:55:16 MLNX2 mlagd[5066]: TID 1251996816: [mlagd.NOTICE]: [MLAG_HEALTH.NOTICE] No mlag-vip - mgmt KA is not supported

Mar 1 17:55:16 MLNX2 mgmtd[4389]: [mgmtd.ERR]: md_system_get_layout_disk_names(), md_system.c:3027, build 1: Unexpected empty disk layout!

Mar 1 17:55:16 MLNX2 mgmtd[4389]: [mgmtd.ERR]: md_system_iterate_disk(), md_system.c:3108, build 1: Error code 14001 (unexpected NULL) returned

szix96 · March 1, 2021, 6:31pm

solved disk layout.

Still issue with: HMAC verification failed

Does anybody had issues with HWMAC before?

Mar 1 18:56:54 MLNX2 clusterd[5560]: [clusterd.NOTICE]: session 2: accepted connection from IP 10.0.24.10

Mar 1 18:56:54 MLNX2 clusterd[5560]: [clusterd.NOTICE]: CCL: Accepted connection from 10.0.24.10, port 37635

Mar 1 18:56:54 MLNX2 clusterd[5560]: [clusterd.ERR]: HMAC verification failed

Mar 1 18:56:54 MLNX2 clusterd[5560]: [clusterd.ERR]: cl_verify_hmac_md5(), cl_comm.c:1508, build 1: Error code 14613 returned

Mar 1 18:56:54 MLNX2 clusterd[5560]: [clusterd.ERR]: cl_ccl_msg_recv(), cl_comm.c:1559, build 1: HMAC verification failed

eddies · March 2, 2021, 9:20am

Hi, please open a support case - it will be easier to troubleshoot