ASUS GX10 ConnectX-7 Will not recognize QSPF-112 cable plugin

So still having issues with ASUS GX10’s not recognizing when I plug the QSPF cables in to bring the ConnectX-7 ports online. It’s like it never sees the cables plugging in. I’ve tried both amphenol and naddod officially supported cables. This is happening on 3 separate GX10s all with the latest BIOS and OS kernel and drivers. Anyone else seeing this issue on GX10s? ASUS support has been less than stellar asking me to run windows diagnostic tools and repeated asks that I verify the BIOS facepalm then repeated we are escalating to our BIOS/Driver team contact you in 1-2 business days and never getting anything back. All 3 run clean with no errors on the fieldiag tests. all 3 come from different production lots so highly doubting hardware failure. NADDOD RMA’d their cable tested it clean, tested a new one on their DGX and shipped me a known good cable.

Any advice or suggestions greatly appreciated

Hi @robert287,

Thanks for all the detail here – I know this is a painful one to debug.

First step that will really help on our side is a full nvidia-bug-report bundle, since that pulls in dmesg, PCI info, driver versions, etc.:

sudo nvidia-bug-report.sh

This will produce a .gz file in your current directory; you can attach that archive to the thread.

A couple of additional details that are useful alongside the bug report:

  1. Environment

    • DGX Spark OS / Ubuntu version and kernel version you’re running.

    • Confirmation that these are stock DGX Spark GX10 systems (no custom OFED/DOCA or kernel modules added).

  2. Link / module behavior on the CX7 port

    • Output of:

      ip a
      
      

      and then:

      sudo ethtool enP2p1s0f1np1
      
      

      (replace enP21s08f1np1 with whatever the CX7 interface name is on your system) before and after you plug in the QSFP cable, so we can see whether the OS ever reports module presence or link changes.

  3. Cables

    • Exact part numbers for the Amphenol and NADDOD cables you’ve tried.

    • For DGX Spark stacking we currently validate against the cables listed in the DGX Spark User Guide:

      • Amphenol NJAAKK‑N911 (QSFP to QSFP112, 32AWG, 400 mm, LSZH) and NJAAKK‑0006 (0.5 m version)

      • Luxshare LMTQF022‑SD‑R (QSFP112 400G DAC cable, 400 mm, 30 AWG)

      Ref: “Spark Stacking” in the DGX Spark User Guide. If your Amphenol cable is one of these PNs (or you have access to one of the listed parts), that data point is very helpful when we look at the logs.

With the nvidia-bug-report plus the interface / cable details, we can see whether the issue is “module never detected” vs. “module detected but link never comes up” and route it appropriately.

So yes stock units no mods

1 note to even get the interfaces to show up I had to disable the hotplug by moving it to .bak and touching an empty file

Linux node-01 6.17.0-1014-nvidia #14-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 17 19:01:40 UTC 2026 aarch64 aarch64 aarch64 GNU/Linux

Distributor ID: Ubuntu

Description: Ubuntu 24.04.4 LTS

Release: 24.04

Codename: noble

ip -a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000

link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

inet 127.0.0.1/8 scope host lo

   valid_lft forever preferred_lft forever

inet6 ::1/128 scope host noprefixroute 

   valid_lft forever preferred_lft forever

2: enP7s7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000

link/ether 30:c5:99:3f:0f:17 brd ff:ff:ff:ff:ff:ff

altname enP7p1s0

inet 10.0.0.1/24 brd 10.0.0.255 scope global noprefixroute enP7s7

   valid_lft forever preferred_lft forever

inet6 fe80::4e6b:11eb:95d2:e621/64 scope link noprefixroute 

   valid_lft forever preferred_lft forever

3: enp1s0f0np0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000

link/ether 30:c5:99:3f:0f:18 brd ff:ff:ff:ff:ff:ff

4: enp1s0f1np1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000

link/ether 30:c5:99:3f:0f:19 brd ff:ff:ff:ff:ff:ff

5: enP2p1s0f0np0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000

link/ether 30:c5:99:3f:0f:1c brd ff:ff:ff:ff:ff:ff

6: enP2p1s0f1np1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000

link/ether 30:c5:99:3f:0f:1d brd ff:ff:ff:ff:ff:ff

7: wlP9s9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000

link/ether 50:bb:b5:a4:5e:06 brd ff:ff:ff:ff:ff:ff

altname wlP9p1s0

inet 192.168.68.56/22 brd 192.168.71.255 scope global dynamic noprefixroute wlP9s9

   valid_lft 4617sec preferred_lft 4617sec

inet6 fdca:d7e:9794:455d:287d:1dbc:9064:8810/64 scope global temporary dynamic 

   valid_lft 1667sec preferred_lft 1667sec

inet6 fdca:d7e:9794:455d:552:d94a:ded0:d890/64 scope global dynamic mngtmpaddr noprefixroute 

   valid_lft 1667sec preferred_lft 1667sec

inet6 fe80::6800:1ed4:ff30:5aa9/64 scope link noprefixroute 

   valid_lft forever preferred_lft forever

8: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default

link/ether 2a:e3:ef:7f:00:55 brd ff:ff:ff:ff:ff:ff

inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0

   valid_lft forever preferred_lft forever

inet6 fe80::28e3:efff:fe7f:55/64 scope link 

   valid_lft forever preferred_lft forever

#######################################

sudo ethtool enP2p1s0f1np1

Settings for enP2p1s0f1np1:

Supported ports: [ ]

Supported link modes: 1000baseT/Full

                    10000baseT/Full

                    1000baseKX/Full

                    10000baseKR/Full

                    10000baseR_FEC

                    40000baseKR4/Full

                    40000baseCR4/Full

                    40000baseSR4/Full

                    40000baseLR4/Full

                    25000baseCR/Full

                    25000baseKR/Full

                    25000baseSR/Full

                    50000baseCR2/Full

                    50000baseKR2/Full

                    100000baseKR4/Full

                    100000baseSR4/Full

                    100000baseCR4/Full

                    100000baseLR4_ER4/Full

                    50000baseSR2/Full

                    1000baseX/Full

                    10000baseCR/Full

                    10000baseSR/Full

                    10000baseLR/Full

                    10000baseER/Full

                    50000baseKR/Full

                    50000baseSR/Full

                    50000baseCR/Full

                    50000baseLR_ER_FR/Full

                    50000baseDR/Full

                    100000baseKR2/Full

                    100000baseSR2/Full

                    100000baseCR2/Full

                    100000baseLR2_ER2_FR2/Full

                    100000baseDR2/Full

                    200000baseKR4/Full

                    200000baseSR4/Full

                    200000baseLR4_ER4_FR4/Full

                    200000baseDR4/Full

                    200000baseCR4/Full

                    100000baseKR/Full

                    100000baseSR/Full

                    100000baseLR_ER_FR/Full

                    100000baseCR/Full

                    100000baseDR/Full

                    200000baseKR2/Full

                    200000baseSR2/Full

                    200000baseLR2_ER2_FR2/Full

                    200000baseDR2/Full

                    200000baseCR2/Full

Supported pause frame use: Symmetric

Supports auto-negotiation: Yes

Supported FEC modes: None RS BASER

Advertised link modes: 1000baseT/Full

                    10000baseT/Full

                    1000baseKX/Full

                    10000baseKR/Full

                    10000baseR_FEC

                    40000baseKR4/Full

                    40000baseCR4/Full

                    40000baseSR4/Full

                    40000baseLR4/Full

                    25000baseCR/Full

                    25000baseKR/Full

                    25000baseSR/Full

                    50000baseCR2/Full

                    50000baseKR2/Full

                    100000baseKR4/Full

                    100000baseSR4/Full

                    100000baseCR4/Full

                    100000baseLR4_ER4/Full

                    50000baseSR2/Full

                    1000baseX/Full

                    10000baseCR/Full

                    10000baseSR/Full

                    10000baseLR/Full

                    10000baseER/Full

                    50000baseKR/Full

                    50000baseSR/Full

                    50000baseCR/Full

                    50000baseLR_ER_FR/Full

                    50000baseDR/Full

                    100000baseKR2/Full

                    100000baseSR2/Full

                    100000baseCR2/Full

                    100000baseLR2_ER2_FR2/Full

                    100000baseDR2/Full

                    200000baseKR4/Full

                    200000baseSR4/Full

                    200000baseLR4_ER4_FR4/Full

                    200000baseDR4/Full

                    200000baseCR4/Full

                    100000baseKR/Full

                    100000baseSR/Full

                    100000baseLR_ER_FR/Full

                    100000baseCR/Full

                    100000baseDR/Full

                    200000baseKR2/Full

                    200000baseSR2/Full

                    200000baseLR2_ER2_FR2/Full

                    200000baseDR2/Full

                    200000baseCR2/Full

Advertised pause frame use: Symmetric

Advertised auto-negotiation: Yes

Advertised FEC modes: Not reported

Speed: Unknown!

Duplex: Unknown! (255)

Auto-negotiation: on

Port: Other

PHYAD: 0

Transceiver: internal

Supports Wake-on: d

Wake-on: d

Link detected: no (No cable)

###########################################################
Amphenol – NJAAKK-N911

NDD Q112-400G-CU0-5

##############################################
The system never ever recognizes the cable plugging in. I’ve tried disconnecting power and holding power button to discharge any power. NDD tested the orginal cable they provided and then they tested a new cable on their pair of DGX Sparks and worked fine
###################################################
I also have the output from a fieldiag run from single user mode though everything flagged as passed and doesn’t look like it does any network device testing

##################################################

I ran some probes in iomem stat vs the addresses that look like they are being called when the system probes the PCI bus and looks like a memory address space is match.

0x05170000-0x051cffff (NVDA8800)

  1. 0xc8000000-0xd7ffffff (NVDA8900)

Looking closely at the iomem dump. Neither of those addresses exists in the kernel’s memory map.

  • There is a massive gap between 0b39ffff and 1002d000.

  • The 0x05170000 address is completely missing from the physical memory space.

  • The 0xc8000000 address is also nowhere to be found (the dump ends around 647fffff).

looks The ASUS BIOS ACPI tables are instructing the ConnectX-7 security enclave to map its management mailboxes (DOE) to physical memory addresses that do not exist on the motherboard.

When the CX7 tries to reach out to that non-existent memory, it returns the -5 (EIO) error, enters a hard panic, and permanently cuts power to the QSFP cages. This is why no pci=realloc or IOMMU bypass will work—you can’t reallocate memory that the physical silicon lacks.

*********START iomem dump *******************************

rob@spark-node1:~$ sudo cat /proc/iomem | grep -A 5 -B 5 -iE “05170000|c8000000|NVDA”

0b316000-0b316003 : MTKW9002:00
0b316004-0b316007 : MTKW9002:00
0b316010-0b316013 : MTKW9002:00
0b316018-0b31601b : MTKW9002:00
0b39f600-0b39ffff : MTKW9002:00
1002d000-1002dfff : NVDA9221:00
1002d000-1002dfff : NVDA9221:00 NVDA9221:00
10200000-10201fff : DRAM8901:00
10206000-10207fff : DRAM8901:00
10208000-10209fff : DRAM8901:00
12410000-12410fff : NVDA9221:00
12440000-12440fff : NVDA9221:00
12440000-12440fff : NVDA9221:00 NVDA9221:00
12460000-12460fff : NVDA9221:00
12460000-12460fff : NVDA9221:00 NVDA9221:00
12800000-12800fff : NVDA9221:00
12830000-12830fff : NVDA9221:00
12830000-12830fff : NVDA9221:00 NVDA9221:00
12850000-12850fff : NVDA9221:00
12850000-12850fff : NVDA9221:00 NVDA9221:00
12870000-12870fff : NVDA9221:00
12870000-12870fff : NVDA9221:00 NVDA9221:00
12890000-12890fff : NVDA9221:00
12890000-12890fff : NVDA9221:00 NVDA9221:00
128b0000-128b0fff : NVDA9221:00
128b0000-128b0fff : NVDA9221:00 NVDA9221:00
12a50000-12a50fff : NVDA9221:00
12a50000-12a50fff : NVDA9221:00 NVDA9221:00
12e00000-12e00fff : NVDA9221:00
12e30000-12e30fff : NVDA9221:00
12e30000-12e30fff : NVDA9221:00 NVDA9221:00
13000000-1301ffff : arm-smmu-v3.1.auto
13000000-13000dff : arm-smmu-v3.1.auto
13002000-13002fff : arm-smmu-v3-pmcg.10.auto
13002000-13002fff : arm-smmu-v3-pmcg.10.auto arm-smmu-v3-pmcg.10.auto
13010000-13010dff : arm-smmu-v3.1.auto

130d2000-130d2fff : arm-smmu-v3-pmcg.15.auto arm-smmu-v3-pmcg.15.auto

130e2000-130e2fff : arm-smmu-v3-pmcg.16.auto
130e2000-130e2fff : arm-smmu-v3-pmcg.16.auto arm-smmu-v3-pmcg.16.auto
130f2000-130f2fff : arm-smmu-v3-pmcg.16.auto
130f2000-130f2fff : arm-smmu-v3-pmcg.16.auto arm-smmu-v3-pmcg.16.auto
13630000-13630fff : NVDA9221:00
13630000-13630fff : NVDA9221:00 NVDA9221:00
13800000-1381ffff : arm-smmu-v3.0.auto
13800000-13800dff : arm-smmu-v3.0.auto
13802000-13802fff : arm-smmu-v3-pmcg.3.auto
13802000-13802fff : arm-smmu-v3-pmcg.3.auto arm-smmu-v3-pmcg.3.auto
13810000-13810dff : arm-smmu-v3.0.auto

138d2000-138d2fff : arm-smmu-v3-pmcg.8.auto arm-smmu-v3-pmcg.8.auto

138e2000-138e2fff : arm-smmu-v3-pmcg.9.auto
138e2000-138e2fff : arm-smmu-v3-pmcg.9.auto arm-smmu-v3-pmcg.9.auto
138f2000-138f2fff : arm-smmu-v3-pmcg.9.auto
138f2000-138f2fff : arm-smmu-v3-pmcg.9.auto arm-smmu-v3-pmcg.9.auto
14200000-14200fff : NVDA2861:00
14900000-1491ffff : arm-smmu-v3.2.auto
14900000-14900dff : arm-smmu-v3.2.auto
14902000-14902fff : arm-smmu-v3-pmcg.17.auto
14902000-14902fff : arm-smmu-v3-pmcg.17.auto arm-smmu-v3-pmcg.17.auto
14910000-14910dff : arm-smmu-v3.2.auto

14952000-14952fff : arm-smmu-v3-pmcg.18.auto arm-smmu-v3-pmcg.18.auto

14962000-14962fff : arm-smmu-v3-pmcg.19.auto
14962000-14962fff : arm-smmu-v3-pmcg.19.auto arm-smmu-v3-pmcg.19.auto
14972000-14972fff : arm-smmu-v3-pmcg.19.auto
14972000-14972fff : arm-smmu-v3-pmcg.19.auto arm-smmu-v3-pmcg.19.auto
16050000-16050fff : NVDA0310:00
16a00000-16a00fff : MTKI0511:00
16a00000-16a0001f : serial
16b20000-16b2ffff : MIPI0100:02
16b30000-16b3ffff : MIPI0100:00
16bd0000-16bd0fff : pnp 00:00
16c10000-16c10fff : NVDA0210:00
16c50000-16c50fff : NVDA0210:01
16c70000-16c70fff : NVDA0210:02
16d10000-16d1ffff : MIPI0100:01
18010000-18010fff : NVDA8600:00
18020000-18020fff : NVDA8601:00
18020000-1802000f : NVDA8601:00
1a000000-1a000fff : NVDA8301:00
1a001000-1a001003 : NVDA8302:00
1a00f000-1a00ffff : NVDA8301:00
1a010000-1a010003 : NVDA8302:00
1a020000-1a07ffff : NVDA8301:00
1a080000-1a09ffff : NVDA8301:00
1a0a0000-1a0fffff : NVDA8302:00
1a100000-1a11ffff : NVDA8302:00
1a120000-1a15ffff : NVDA8301:00
1a160000-1a19ffff : NVDA8302:00
1a350000-1a350fff : NVDA8301:00
1a360000-1a360fff : NVDA8302:00
1a400000-1a400fff : NVDA8303:00
1a40f000-1a40ffff : NVDA8303:00
1a420000-1a47ffff : NVDA8303:00
1a480000-1a4bffff : NVDA8303:00
1a560000-1a5dffff : NVDA8303:00
1a750000-1a750fff : NVDA8303:00
1a800000-1a87ffff : NVDA8200:00
1aaa0000-1aaa0fff : NVDA8200:00
1aab0000-1aab0fff : NVDA8200:00
1ab20000-1ab20fff : NVDA8200:00
1c004000-1c005fff : DRAM8901:00
1c041000-1c041fff : sbsa-gwdt.0
1c041000-1c041fff : sbsa-gwdt.0 sbsa-gwdt.0
1c042000-1c042fff : sbsa-gwdt.0
1c042000-1c042fff : sbsa-gwdt.0 sbsa-gwdt.0
1c440000-1c440fff : arch_mem_timer
1c548000-1c548133 : SPMI0001:00
1c548200-1c548333 : SPMI0002:00
1c548400-1c548533 : SPMI0003:00
1c54a000-1c54afff : NVDA9221:00
1c570000-1c5700ff : SPMI0001:00
1c5c0000-1c5c00ff : SPMI0002:00
1c610000-1c6100ff : SPMI0003:00
1c660000-1c6600ff : SPMI0004:00
1c6a0000-1c6a08fe : SPMI0001:00
1c6b0000-1c6b08fe : SPMI0002:00
1c6c0000-1c6c08fe : SPMI0003:00
1c6d0000-1c6d08fe : SPMI0004:00
1c8b0000-1c8b00ff : NVDA6210:00
1c8c0000-1c8c00ff : NVDA6210:00
1c900000-1c900fff : NVDA6210:00
1d600000-1d600fff : pnp 00:00
1d640000-1d640fff : pnp 00:00
1d690000-1d690fff : pnp 00:00
1d790000-1d790fff : pnp 00:00
1d860000-1d8677ff : NVDA8001:00
1d860000-1d8677ff : NVDA8001:00 NVDA8001:00
1d868000-1d8680ff : NVDA8001:00
1d870000-1d8777ff : NVDA8000:04
1d870000-1d8777ff : NVDA8000:04 NVDA8000:04
1d878000-1d8780ff : NVDA8000:04
1d880000-1d883fff : NVDA8001:00
1d890000-1d894fff : NVDA8000:04
1d8e0560-1d8e0577 : NVDA8001:00
1d8e0578-1d8e0593 : NVDA8000:04
1db60000-1db677ff : NVDA8000:00
1db60000-1db677ff : NVDA8000:00 NVDA8000:00
1db68000-1db680ff : NVDA8000:00
1db70000-1db73fff : NVDA8000:00
1db90000-1db977ff : NVDA8000:01
1db90000-1db977ff : NVDA8000:01 NVDA8000:01
1db98000-1db980ff : NVDA8000:01
1dba0000-1dba3fff : NVDA8000:01
1dbd012c-1dbd0143 : NVDA8000:00
1dbd0144-1dbd015b : NVDA8000:01
1dde0000-1dde77ff : NVDA8000:02
1dde0000-1dde77ff : NVDA8000:02 NVDA8000:02
1dde8000-1dde80ff : NVDA8000:02
1ddf0000-1ddf3fff : NVDA8000:02
1de10000-1de177ff : NVDA8000:03
1de10000-1de177ff : NVDA8000:03 NVDA8000:03
1de18000-1de180ff : NVDA8000:03
1de20000-1de23fff : NVDA8000:03
1de5012c-1de50143 : NVDA8000:02
1de50144-1de5015b : NVDA8000:03
24000000-281fffff : PCI Bus 000f:00
24000000-27ffffff : PCI Bus 000f:01
24000000-27ffffff : 000f:01:00.0
24000000-27ffffff : nvidia
29000000-291fffff : PCI ECAM

311c0100-311c0103 : MTK00055:00
311c0104-311c0107 : MTK00055:00
311c0108-311c010b : MTK00055:00
311c010c-311c010f : MTK00055:00
31270074-31270077 : MTK00055:00
36078000-3607ffff : NVDA2014:00
36078000-3607ffff : NVDA2014:00 NVDA2014:00
3e900000-3e90ffff : NVDA3000:00
5d010000-5f7fffff : PCI Bus 0002:00
5d100000-5d2fffff : PCI Bus 0002:01
5d100000-5d1fffff : 0002:01:00.0
5d200000-5d2fffff : 0002:01:00.1
62010000-647fffff : PCI Bus 0004:00

nvidia-bug-report.log.gz (2.1 MB)

mlxlink details

Physical state                : ETH_AN_FSM_ENABLE       ← stuck in auto-neg FSM

Supported Cable Speed (Ext.) : 0x00000000 () ← NIC sees ZERO cable speeds

Status Opcode : 1024

Group Opcode : MNG FW ← firmware-side assertion

Recommendation : Cable is unplugged ← FW's conclusion, wrong

Identical on both CX-7 ASICs. This gives the NVIDIA dev a specific firmware opcode to trace: 1024 in the MNG FW group. Much more actionable than just “status: 0x3”.

ethtool -m — same root cause:

netlink error: mlx5_core: Query module eeprom by page failed, read 0 bytes, err -5

All 4 ports. err -5 = EIO. The firmware’s I²C/MCIA path to the QSFP EEPROM is just dead. That’s what mlx5_query_mcia 0x3 is reporting at a higher level.

  • mst_pci kernel module fails to build for 6.17.0-1014-nvidia — MFT 4.34.1-12 doesn’t install it. Diagnostics still work via PCI address so not blocking, but worth NVIDIA knowing.

Firmware triangle captured cleanly:

  • FW: 28.45.4028

  • MFT: 4.34.1-12

  • amBER: 5.75

Thanks for all the extra detail you pulled together, especially the nvidia-bug-report, mlxlink, ethtool/ethtool -m, and /proc/iomem data.

In short, on multiple stock GX10s (no custom OFED/DOCA, hotplug already disabled) we see:

  • All ConnectX-7 functions enumerate and bind to mlx5_core, but CX7 ports (e.g. enP2p1s0f1np1) stay NO-CARRIER with Speed: Unknown / Link detected: no (No cable).

  • mlxlink reports ETH_AN_FSM_ENABLE, Supported Cable Speed (Ext.) = 0x00000000 (), and MNG FW status opcode 1024 on every port.

  • ethtool -m fails on all CX7 ports with mlx5_core: Query module eeprom by page failed, read 0 bytes, err -5, so every module EEPROM read returns EIO.

  • This reproduces with approved stacking cables (Amphenol NJAAKK-N911, NADDOD Q112-400G-CU0-5), and NADDOD has confirmed the same cable works on their own DGX Spark pair.

There’s very clearly an issue here. I’ve escalated this issue to the internal DGX Spark and ConnectX-7 (NBU) teams so they can investigate the firmware/platform interaction and try to reproduce it on our side.

For others following along, the currently approved CX‑7 stacking cables for DGX Spark are listed here:
Spark Stacking — DGX Spark User Guide

We don’t have a workaround or fix to recommend yet, but this is being actively investigated internally. As soon as we have concrete guidance (for example, a firmware/BIOS update, confirmed configuration issue, or other next steps), I’ll post an update in this thread.

Hi @robert287,

The firmware team needs ConnectX‑7 firmware dumps from your system to move the debug forward. Here are the steps to collect them and share them back.

1. Check whether MFT / mst is available

mst --version

  • If you get a version string, continue to step 2.

  • If you see “command not found”, skip down to “If mst is not installed”.

2. Start mst and identify the CX7 devices

sudo mst start
sudo mst status -v

Look for lines like:

/dev/mst/mt4129_pciconf0  # (names may vary slightly)

These /dev/mst/mt41xx_pciconf* entries are the ConnectX‑7 devices.

3. Collect firmware dumps for each CX7 device

For each CX7 entry you found (adjust the device name and output filename each time):

# Collect on‑NIC amber (firmware) logs
sudo mlxlink -d /dev/mst/mt4129_pciconf0 --amber_collect

# Dump firmware state to a file
sudo mstdump /dev/mst/mt4129_pciconf0 > cx7_fw_dump_0.txt

Repeat for mt4129_pciconf1, etc., changing the output file (e.g. cx7_fw_dump_1.txt, cx7_fw_dump_2.txt, …).

4. Package the dumps

From the directory where the cx7_fw_dump_*.txt files are located:

tar czf gx10_cx7_fw_dumps_node1.tgz cx7_fw_dump_*.txt

If you have multiple nodes, you can repeat the same procedure and use different archive names, e.g. gx10_cx7_fw_dumps_node2.tgz.

5. Upload / share the archive

Preferred options (any one of these is fine):

  • Attach the .tgz file directly to this forum thread in a reply, so we can download it and pass it to the firmware team.

  • If you’d rather not post the dumps publicly, you can use the forum’s direct message feature to send the archive to me and I’ll share it internally.


If mst is not installed

If mst --version fails with “command not found”, we’ll first need Mellanox Firmware Tools (MFT):

  1. You can install MFT from the NVIDIA Networking “Mellanox Firmware Tools (MFT)” download page:
    https://network.nvidia.com/products/adapter-software/firmware-tools/
    (choose the Linux/Ubuntu package that matches your OS).

  2. After installing MFT, rerun the steps above starting from:

sudo mst start
sudo mst status -v

Once you’ve uploaded the archive, please reply here to let me know; I’ll confirm we received it and route it to the firmware team.

I’ll upload them today for all 3 and if the 4th one I am getting today has the same problem I’ll upllad those as well

Thanks so much! Standing by!

Hrmm not letting upload .tgz files here in the forums can I e-mail them to you?

I sent you a direct message. Note, .tar.gz and .tgz are generally equivalent.

gx10_cx7_fw_dumps_node3.tar.gz (3.7 KB)

gx10_cx7_fw_dumps_node1.tar.gz (2.8 MB)

gx10_cx7_fw_dumps_node2.tar.gz (2.9 MB)

awesome, thanks so much @robert287

Just FYI if the newest firmware was supposed to fix the ConnectX-7 issues I am seeing not the cas they are still being ejected right after boot :(

The cx7-pcie-hotplug MTKP0001:00: Cable removal event still fires ~16s after boot, all 4 CX-7s still get torn out by the handler. Same trap, same outcome

@Neill Just FYI ASUS has formally punted and said they can do nothing for this issue and even RMAing the devices will not fix and I need to rely on Nvidia to solve the problem FACEPALM quality support from the Vendor, this like Ford telling me personally to call Bosch to fix a sensor problem but whatever. Cursious if you’ve been able to make any progress on the issue.

one other interesting thing from dmesg I am seeing

126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link). Shouldn’t that be showing me x16? Are there not 16 lanes?

@Neill I ran some more troubleshooting tonight

CX7 PCIe enumeration = works
mlx5 driver/firmware = works
manual MTK plug-in = restores interfaces
QSFP presence / EEPROM / I2C path = broken across all cages

After manually forcing MTK CX7 hotplug plug-in, all four ConnectX-7 mlx5 interfaces enumerate successfully. However, every CX7 port still reports NO-CARRIER / “No cable,” and ethtool -m fails identically on all four interfaces with:

mlx5_core: Query module eeprom by page failed, read 0 bytes, err -5

This reproduces across multiple NVIDIA-confirmed QSFP112/QSFP56 DACs. The remaining failure appears to be DGX Spark / ASUS GX10 platform-level QSFP module presence, EEPROM/I2C, or cage power/control path, not Linux network configuration or ConnectX-7 PCIe enumeration.

One last thing you’ll notice the change in the message about power state when I force the ports back online with the hotplug script no long lower power below 27W to what you see below – PCIe slot power capability was not advertised.

abling device (0000 → 0002)

[ 1587.676665] mlx5_core 0000:01:00.0: firmware version: 28.45.4028

[ 1587.676684] mlx5_core 0000:01:00.0: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)

[ 1588.030503] mlx5_core 0000:01:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps

[ 1588.031107] mlx5_core 0000:01:00.0: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)

[ 1588.034878] mlx5_core 0000:01:00.0: Flow counters bulk query buffer size increased, bulk_query_len(8)

[ 1588.046916] mlx5_core 0000:01:00.0: Port module event: module 0, Cable unplugged

[ 1588.048681] mlx5_core 0000:01:00.0: mlx5_pcie_event:322:(pid 22935): PCIe slot power capability was not advertised.

[ 1588.056362] mlx5_core 0000:01:00.0: mlx5e: IPSec ESP acceleration enabled

[ 1588.207894] mlx5_core 0000:01:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced)

[ 1588.209781] mlx5_core 0000:01:00.0 enp1s0f0np0: renamed from eth0

[ 1588.233106] mlx5_core 0000:01:00.1: enabling device (0000 → 0002)

[ 1588.233464] mlx5_core 0000:01:00.1: firmware version: 28.45.4028

[ 1588.233525] mlx5_core 0000:01:00.1: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)

[ 1588.688397] mlx5_core 0000:01:00.0 enp1s0f0np0: Link down

[ 1588.728414] mlx5_core 0000:01:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps

[ 1588.728470] mlx5_core 0000:01:00.1: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)

[ 1588.728651] mlx5_core 0000:01:00.1: Flow counters bulk query buffer size increased, bulk_query_len(8)

[ 1588.741303] mlx5_core 0000:01:00.1: mlx5e: IPSec ESP acceleration enabled

[ 1588.749968] mlx5_core 0000:01:00.1: Port module event: module 1, Cable unplugged

[ 1588.750411] mlx5_core 0000:01:00.1: mlx5_pcie_event:322:(pid 23002): PCIe slot power capability was not advertised.

[ 1588.888467] mlx5_core 0000:01:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced)

[ 1588.889857] mlx5_core 0000:01:00.1 enp1s0f1np1: renamed from eth0

[ 1588.918492] mlx5_core 0002:01:00.0: enabling device (0000 → 0002)

[ 1588.918643] mlx5_core 0002:01:00.0: firmware version: 28.45.4028

[ 1588.918672] mlx5_core 0002:01:00.0: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)

[ 1589.419841] mlx5_core 0002:01:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps

[ 1589.420189] mlx5_core 0002:01:00.0: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)

[ 1589.422935] mlx5_core 0002:01:00.0: Flow counters bulk query buffer size increased, bulk_query_len(8)

[ 1589.423895] mlx5_core 0000:01:00.1 enp1s0f1np1: Link down

[ 1589.430698] mlx5_core 0002:01:00.0: Port module event: module 0, Cable unplugged

[ 1589.431023] mlx5_core 0002:01:00.0: mlx5_pcie_event:322:(pid 537): PCIe slot power capability was not advertised.

[ 1589.436932] mlx5_core 0002:01:00.0: mlx5e: IPSec ESP acceleration enabled

[ 1589.559594] mlx5_core 0002:01:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced)

[ 1589.561299] mlx5_core 0002:01:00.0 enP2p1s0f0np0: renamed from eth0

[ 1589.588129] mlx5_core 0002:01:00.1: enabling device (0000 → 0002)

[ 1589.588477] mlx5_core 0002:01:00.1: firmware version: 28.45.4028

[ 1589.588549] mlx5_core 0002:01:00.1: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)

[ 1590.043026] mlx5_core 0002:01:00.0 enP2p1s0f0np0: Link down

[ 1590.091784] mlx5_core 0002:01:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps

[ 1590.092080] mlx5_core 0002:01:00.1: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)

[ 1590.093620] mlx5_core 0002:01:00.1: Flow counters bulk query buffer size increased, bulk_query_len(8)

[ 1590.107633] mlx5_core 0002:01:00.1: mlx5e: IPSec ESP acceleration enabled

[ 1590.111834] mlx5_core 0002:01:00.1: Port module event: module 1, Cable unplugged

[ 1590.112298] mlx5_core 0002:01:00.1: mlx5_pcie_event:322:(pid 23002): PCIe slot power capability was not advertised.

[ 1590.268659] mlx5_core 0002:01:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced)

[ 1590.270100] mlx5_core 0002:01:00.1 enP2p1s0f1np1: renamed from eth0

[ 1590.743132] mlx5_core 0002:01:00.1 enP2p1s0f1np1: Link down

HI @robert287. I’m really sorry to hear about your ASUS support experience. On our end, we’re working through the issue and trying to get a reproduction.

Question from our eng team, if you leave the cable plugged in through a restart cycle, does the behavior change?

Does not change the operation.

I’ve tried no cables plugged in both sides plugged in only 1 side plugged in port 1 plugged into port 1 port 2 to port 2 port 1 to port 2 and port 2 to port 1 and with both amphenol, naddod and now an FS.com QSPF-56 cable as well since it was only reporting x4 lanes I was curious if a 56 cable would solve it.

I am curious about the fact that it’s only reporting x4 lanes before any low power warnings or anything with the hotplug issue.

Also I have a MikroTik CRS804 and cables for it for the ConnectX-7 are due today I’m curious if a handshake from an active switch gives me a different response vs GX10’s that are apparently in a wonky state across the board perhaps timing on when the ports are active ConnectX have been known to be very aggressive in their handshake modes

but I still find the x4 lanes before any power shunts hotplugs clearing the fabric or anything as curious

Configuring CRS804 takes some time on its own. I have four ASUS boxes, two connected via DAC and all four connected via switch with 400G → 2x200G breakout cables. This allows me to power sometime only two of them, if model is not that demanding. With Qwen3.6-27B I actually mostly use only one box :) My boxes were bough in two batches, so not the same batch, but overall I do not have problems. All cables I ordered just work flawless even they are not officially recommended by NVIDIA.