100 GbE on a Jetson AGX Orin dev kit

,

What is our best chance to use a 100 GbE NIC (with DPDK) in a Jetson AGX Orin dev kit? So far, we tried:

  • an NVIDIA MCX653105A-ECAT: not detected with lspci after booting (not even after echo 1 >/sys/bus/pci/rescan).
  • an Intel e810: works fine with the ice driver, but not with DPDK, as the vfio_pci driver complains that IOMMU group 12 is not viable
  • a QNAP QXG-25G2SF-CX6: works (also with DPDK), but needs multiple boot attempts to be detected, and is only dual 25 GbE.
    We use JetPack 5.0.2 with the Jetson Linux 35.1 BSP. Solutions as described in SOLUTION/TUTORIAL: Jetson ORIN Enabling PCIE power and Jetson AGX Orin board not able to enumerate FPGA PCIe card (Latest kernel 35.1.0) did not solve our problems. The dmesg output is attached. Any help is more than welcome!
    dmesg (78.5 KB)

Sorry for the late response, is this still an issue to support? Thanks

Yes, we still did not find a way to make any type of 100 GbE NIC work in the AGX Orin, either because the NIC is not detected at boot time, or because the vfio_pci driver used by DPDK does not work (we need DPDK to avoid prohibitive OS overhead). We use the Orin’s tensor cores for signal processing, and they are so fast that we think that it can handle close to 100 Gb/s of input in less than 75 Watt (including NIC), which would set a new milestone in energy efficiency. But we need a working 100 GbE NIC for that.

Hey John, I was in a very similar situation. My ConnectX-5 EX was not recognized by the Jetson AGX Orin. The solution was to update the ConnectX FW, PXE, and UEFI. After the update was made, the board is now recognized by the Jetson.

Initializing...
Attempting to perform Firmware update...
Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX5
  Part Number:      MCX516A-CDA_Ax_Bx
  Description:      ConnectX-5 Ex EN network interface card; 100GbE dual-port QSFP28; PCIe4.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000013
  PCI Device Name:  03:00.0
  Base GUID:        1070fd0300982146
  Base MAC:         1070fd982146
  Versions:         Current        Available
     FW             16.32.1010     16.35.2000
     PXE            3.6.0502       3.6.0805
     UEFI           14.25.0017     14.28.0016

  Status:           Update required

---------
Found 1 device(s) requiring firmware update...

Device #1: Updating FW ...
FSMST_INITIALIZE -   OK
Writing Boot image component -   OK
Done
0005:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
        Subsystem: Mellanox Technologies ConnectX-5 Ex EN network interface card, 100GbE dual-port QSFP28, PCIe4.0 x16, tall bracket; MCX516A-CDAT
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin B routed to IRQ 55
        Region 0: Memory at 2742000000 (64-bit, prefetchable) [size=32M]
        Expansion ROM at 2b28100000 [disabled] [size=1M]
        Capabilities: [60] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s (ok), Width x8 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABC, TimeoutDis+, NROPrPrP-, LTR-
                         10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, TPHComp-, ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [48] Vital Product Data
                Product Name: CX516A - ConnectX-5 QSFP28
                Read-only fields:
                        [PN] Part number: MCX516A-CDAT
                        [EC] Engineering changes: B7
                        [V2] Vendor specific: MCX516A-CDAT
                        [SN] Serial number: MT2215J15318
                        [V3] Vendor specific: 7e1cc93ad1b5ec1180001070fd982146
                        [VA] Vendor specific: MLX:MODL=CX516A:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0
                        [V0] Vendor specific: PCIeGen4 x16
                        [VU] Vendor specific: MT2215J15318MLNXS0D0F1
                        [RV] Reserved: checksum good, 2 byte(s) reserved
                End
        Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00003000
        Capabilities: [c0] Vendor Specific Information: Len=18 <?>
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 08, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [230 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Capabilities: [420 v1] Data Link Feature <?>
        Kernel driver in use: mlx5_core
        Kernel modules: mlx5_core
1 Like

Great; I will give it a try. I assume that you did the update in another machine that did recognize the card?

Yes, the update was made on an X86 desktop. The card was connected on a PCIe 3.0 x8 (in x16) slot (not important but worth mentioning).

Updating the firmware solved the problem with our ConnextX-6 NIC too. The NIC originated from a Lenovo server, and it turned out that Lenovo installed its own firmware. The NIC is now properly detected by the Orin. Thanks for the tip; it was really helpful!

./mlxfwmanager_LeSI_22C_OFED-5.8-0_buid1
Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX6
  Part Number:      SC57A40943_Ax
  Description:      ThinkSystem Mellanox ConnectX-6 HDR100/100GbE QSFP56 1-port VPI Adapter
  PSID:             LNV0000000016
  PCI Device Name:  0000:01:00.0
  Base GUID:        043f720300d42ff0
  Base MAC:         043f72d42ff0
  Versions:         Current        Available
     FW             20.28.1002     20.35.1012
     PXE            3.6.0101       3.6.0804
     UEFI           14.21.0016     14.28.0015

  Status:           Update required
1 Like