Driving External ConnectX7 without Reset Attached

We have a setup in a server environment where access to our ConnectX 7 PCIe adapter card is done over a PCIe extension mechanism. This extension uses SFf8644 cabling without any PCIe switches in between. SFF8644 is essentially a data only interconnect (non common clock; no reset interconnected).

Due to restrictions of our environment, I can’t not disclose specific photos. Below is the diagram of our interconnect:

This interconnect works fine from a PCIe negotiation perspective and we are able to see the ConnectX7 on the PCIe bus (after setting disable-spread in device table for orin to allow for non common clocking.

The issue we are facing is the ConnectX7 is not successfully initializing the firmware once the mlx driver mounts. We receive the following timeout message:

[   15.314212] mlx5_core 0005:01:00.0: Adding to iommu group 11

[   15.326259] mlx5_core 0005:01:00.0: enabling device (0000 -> 0002)

[   15.338075] mlx5_core 0005:01:00.0: firmware version: 28.39.1002

[   15.371352] mlx5_core 0005:01:00.0: 63.012 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x4 link at 0005:00:00.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)

[   35.398646] mlx5_core 0005:01:00.0: wait_fw_init:203:(pid 392): Waiting for FW initialization, timeout abort in 100s

[   55.414655] mlx5_core 0005:01:00.0: wait_fw_init:203:(pid 392): Waiting for FW initialization, timeout abort in 79s
[   75.434649] mlx5_core 0005:01:00.0: wait_fw_init:203:(pid 392): Waiting for FW initialization, timeout abort in 59s
[   95.456443] mlx5_core 0005:01:00.0: wait_fw_init:203:(pid 392): Waiting for FW initialization, timeout abort in 39s
[  115.476985] mlx5_core 0005:01:00.0: wait_fw_init:203:(pid 392): Waiting for FW initialization, timeout abort in 19s
[  135.397133] mlx5_core 0005:01:00.0: mlx5_function_setup:962:(pid 392): Firmware over 120000 MS in pre-initializing state, aborting
[  135.409275] mlx5_core 0005:01:00.0: init_one:1372:(pid 392): mlx5_load_one failed with error code -16
[  135.419545] mlx5_core: probe of 0005:01:00.0 failed with error -16
[  135.426717] mlx5_core 0005:01:00.1: Adding to iommu group 11
[  135.427512] mlx5_core 0005:01:00.1: enabling device (0000 -> 0002)
[  135.427786] mlx5_core 0005:01:00.1: firmware version: 28.39.1002
[  135.427850] mlx5_core 0005:01:00.1: 63.012 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x4 link at 0005:00:00.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
[  155.432443] mlx5_core 0005:01:00.1: wait_fw_init:203:(pid 392): Waiting for FW initialization, timeout abort in 100s
[  175.444072] mlx5_core 0005:01:00.1: wait_fw_init:203:(pid 392): Waiting for FW initialization, timeout abort in 79s
[  195.461335] mlx5_core 0005:01:00.1: wait_fw_init:203:(pid 392): Waiting for FW initialization, timeout abort in 59s
[  215.475472] mlx5_core 0005:01:00.1: wait_fw_init:203:(pid 392): Waiting for FW initialization, timeout abort in 39s
[  235.494051] mlx5_core 0005:01:00.1: wait_fw_init:203:(pid 392): Waiting for FW initialization, timeout abort in 19s
[  255.416888] mlx5_core 0005:01:00.1: mlx5_function_setup:962:(pid 392): Firmware over 120000 MS in pre-initializing state, aborting
[  255.429020] mlx5_core 0005:01:00.1: init_one:1372:(pid 392): mlx5_load_one failed with error code -16
[  255.439316] mlx5_core: probe of 0005:01:00.1 failed with error -16

When we plug the adapter directly into the C5 interface, it works properly and we get the following:

[   15.396601] mlx5_core 0005:01:00.0: Adding to iommu group 11

[   15.407985] mlx5_core 0005:01:00.0: enabling device (0000 -> 0002)

[   15.418527] mlx5_core 0005:01:00.0: firmware version: 28.39.1002
[   15.418588] mlx5_core 0005:01:00.0: 126.024 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x8 link at 0005:00:00.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)

[   15.434980] mlx5_core 0005:01:00.0: handle_hca_cap:528:(pid 380): log_max_qp value in current profile is 18, changing it to HCA capability limit (17)

[   15.800516] mlx5_core 0005:01:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps
[   15.800524] mlx5_core 0005:01:00.0: E-Switch: Total vports 18, per vport: max uc(128) max mc(2048)

[   15.820062] mlx5_core 0005:01:00.1: Adding to iommu group 11
[   15.820605] mlx5_core 0005:01:00.1: enabling device (0000 -> 0002)
[   15.820888] mlx5_core 0005:01:00.1: firmware version: 28.39.1002
[   15.820982] mlx5_core 0005:01:00.1: 126.024 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x8 link at 0005:00:00.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
[   15.821403] mlx5_core 0005:01:00.0: Port module event: module 0, Cable unplugged
[   15.821615] mlx5_core 0005:01:00.0: mlx5_pcie_event:289:(pid 7): Detected insufficient power on the PCIe slot (27W).

[   16.165004] mlx5_core 0005:01:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps
[   16.165018] mlx5_core 0005:01:00.1: E-Switch: Total vports 18, per vport: max uc(128) max mc(2048)
[   16.183577] mlx5_core 0005:01:00.1: Port module event: module 1, Cable unplugged
[   16.183778] mlx5_core 0005:01:00.1: mlx5_pcie_event:289:(pid 152): Detected insufficient power on the PCIe slot (27W).
[   16.184013] mlx5_core 0005:01:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[   16.316866] mlx5_core 0005:01:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295
[   16.321297] mlx5_core 0005:01:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[   16.438954] mlx5_core 0005:01:00.1: Supported tc offload range - chains: 4294967294, prios: 4294967295

The last variable we have seen is if we hold off powering up the ConnectX7 until after the UEFI bootloader completes (only power on when we see the first log our of the Linux kernel display but prior to pcieport driver setup), everything works fine in our setup and the firmware initializes.

This has lead us to believe that since reset and power are separated between the 2 entities (CX7 and orin), there must be some issue with the CX7 being re-initialized without a hard reset applied. IE it negotiaties PCIe at UEFI layer but gets confused when it gets reinitialized again at firmware layer.

We have tried doing /sys/ bus rescan and reset of the CX7 and same firmware issue shows (we assume because full reset is required not just PCIe reset).

Again this works fine with an x86 node. Our goal was to take this existing infrastructure and vet use with Orin AGX.

Any advice would help. We have reached out to the Mellanox team but they advised posting to the Jetson forums in parallel.

Sorry for the late response.
Is this still an issue to support? Any result can be shared?