Hardware issue

I have issues with the GPU when I am trying to use the two Spark in a cluster mode with Ray +vllm

Problem statement

GPU PCIe link does not train correctly.
Root port 000f:00:00.0 reports LnkSta: Speed unknown, Width x0.
GPU 000f:01:00.0 is stuck at PCIe Gen1 x1.
Kernel logs report 0.000 Gb/s available PCIe bandwidth and repeated DOE mailbox timeouts.
Issue persists across reboots, kernel parameters, and OS configuration changes.

I suspect hardware failure.

I did update the system with the latests updates for OS and BIOS etc.

However the issue persists.

any thoughts?

uname -a
Linux koula 6.14.0-1015-nvidia #15-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 25 18:02:16 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
NVIDIA_DGX_Spark A.7 P4242 A04

Vendor: American Megatrends International, LLC.

Version: 5.36_0ACUM018

Release Date: 08/06/2025

American Megatrends International, LLC.

5.36_0ACUM018

08/06/2025

From dmesg:



[    0.097574] pci 0009:00:00.0: Max Payload Size set to  512/ 512 (was  128), Max Read Rq  512

[    0.097594] pci 0009:01:00.0: Max Payload Size set to  256/ 256 (was  128), Max Read Rq  256

[    0.102037] platform NVDA8800:00: failed to claim resource 0: [mem 0x05170000-0x051cffff]

[    0.102050] acpi NVDA8800:00: platform device creation failed: -16

[    0.102126] platform NVDA8900:00: failed to claim resource 0: [mem 0xc8000000-0xd7ffffff]

[    0.102132] acpi NVDA8900:00: platform device creation failed: -16

[    0.102927] ACPI: PCI Root Bridge [PCIF] (domain 000f [bus 00-01])

[    0.102938] acpi PNP0A08:0b: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI EDR HPX-Type3]

[    0.103011] acpi PNP0A08:0b: _OSC: platform does not support [SHPCHotplug DPC]

[    0.103105] acpi PNP0A08:0b: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]

[    0.103855] acpi PNP0A08:0b: ECAM area [mem 0x29000000-0x291fffff] reserved by PNP0C02:01

[    0.103873] acpi PNP0A08:0b: ECAM at [mem 0x29000000-0x291fffff] for [bus 00-01]

[    0.104003] PCI host bridge to bus 000f:00

[    0.104032] pci_bus 000f:00: root bus resource [mem 0x24000000-0x281fffff window]

[    0.104037] pci_bus 000f:00: root bus resource [bus 00-01]

[    0.104071] pci 000f:00:00.0: [10de:22d1] type 01 class 0x060400 PCIe Root Port

[    0.104108] pci 000f:00:00.0: PCI bridge to [bus 01]

[    0.104134] pci 000f:00:00.0:   bridge window [mem 0x24000000-0x27ffffff 64bit pref]

[    0.104291] pci 000f:00:00.0: PME# supported from D0 D3hot

[    0.104824] pci 000f:01:00.0: [10de:2e12] type 00 class 0x030000 PCIe Endpoint

[    0.104894] pci 000f:01:00.0: BAR 0 [mem 0x24000000-0x27ffffff 64bit pref]

[    0.104964] pci 000f:01:00.0: Enabling HDA controller

[    1.110847] pci 000f:01:00.0: DOE: [2c8] ABORT timed out

[    1.110854] pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5

[    1.110866] pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5

[    1.110905] pci 000f:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at 000f:00:00.0 (capable of 32.000 Gb/s with 2.5 GT/s PCIe x16 link)

[    1.111279] pci 000f:00:00.0: PCI bridge to [bus 01]

[    1.111304] pci 000f:00:00.0: PCI bridge to [bus 01]
sudo dmidecode -t bios

# dmidecode 3.5

Getting SMBIOS data from sysfs.

SMBIOS 3.3.0 present.



Handle 0x0000, DMI type 0, 26 bytes

BIOS Information

Vendor: American Megatrends International, LLC.

Version: 5.36_0ACUM018

Release Date: 08/06/2025

Address: 0xF0000

Runtime Size: 64 kB

ROM Size: 32 MB

Characteristics:

PCI is supported

BIOS is upgradeable

BIOS shadowing is allowed

Boot from CD is supported

Selectable boot is supported

BIOS ROM is socketed

ACPI is supported

BIOS boot specification is supported

Targeted content distribution is supported

UEFI is supported

BIOS Revision: 5.36
Handle 0x0006, DMI type 13, 22 bytes

BIOS Language Information

Language Description Format: Long

Installable Languages: 1

en|US|iso8859-1

Currently Installed Language: en|US|iso8859-1
sudo lspci -s 000f:00:00.0 -vv

000f:00:00.0 PCI bridge: NVIDIA Corporation Device 22d1 (prog-if 00 [Normal decode])

Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+

Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

Latency: 0

Interrupt: pin ? routed to IRQ 343

IOMMU group: 5

Bus: primary=00, secondary=01, subordinate=01, sec-latency=0

I/O behind bridge: [disabled] [32-bit]

Memory behind bridge: [disabled] [32-bit]

Prefetchable memory behind bridge: 24000000-27ffffff [size=64M] [32-bit]

Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-

BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-

PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-

Capabilities: [40] Power Management version 3

Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)

Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-

Capabilities: [48] Express (v2) Root Port (Slot-), MSI 00

DevCap: MaxPayload 512 bytes, PhantFunc 0

ExtTag+ RBE+

DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+

RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+

MaxPayload 512 bytes, MaxReadReq 512 bytes

DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-

LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM L1, Exit Latency L1 <32us

ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+

LnkCtl: ASPM Disabled; RCB 128 bytes, Disabled- CommClk-

ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt+

LnkSta: Speed unknown, Width x0

TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-

RootCap: CRSVisible+

RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+

RootSta: PME ReqID 0000, PMEStatus- PMEPending-

DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR+

10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1

EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-

FRS+ LN System CLS Not Supported, TPHComp+ ExtTPHComp- ARIFwd+

AtomicOpsCap: Routing- 32bit+ 64bit+ 128bitCAS+

DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled, ARIFwd+

AtomicOpsCtl: ReqEn- EgressBlck-

LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS+

LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-

Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-

Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot

LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-

EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-

Retimer- 2Retimers- CrosslinkRes: Downstream Port, DRS-

DownstreamComp: Link Up - Present

Capabilities: [84] MSI: Enable+ Count=4/4 Maskable+ 64bit+

Address: 0000000006850040  Data: 0000

Masking: 0000000a  Pending: 00000000

Capabilities: [100 v1] Secondary PCI Express

LnkCtl3: LnkEquIntrruptEn- PerformEqu-

LaneErrStat: 0

Capabilities: [12c v1] Data Link Feature <?>

Capabilities: [138 v1] Physical Layer 16.0 GT/s <?>

Capabilities: [168 v1] Extended Capability ID 0x2a

Capabilities: [198 v1] Lane Margining at the Receiver <?>

Capabilities: [1f4 v2] Advanced Error Reporting

UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-

CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-

CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+

AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-

MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap+

HeaderLog: 00000000 00000000 00000000 00000000

RootCmd: CERptEn+ NFERptEn+ FERptEn+

RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-

FirstFatal- NonFatalMsg- FatalMsg- IntMsg 2

ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000

Capabilities: [23c v1] Access Control Services

ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-

ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-

Capabilities: [244 v1] FRS Queueing <?>

Capabilities: [298 v1] Hierarchy ID <?>

Capabilities: [2b8 v1] Extended Capability ID 0x30

Kernel driver in use: pcieport


sudo lspci -s 000f:01:00.0 -vv

000f:01:00.0 VGA compatible controller: NVIDIA Corporation Device 2e12 (rev a1) (prog-if 00 [VGA controller])

Subsystem: NVIDIA Corporation Device 0000

Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+

Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

Latency: 0

Interrupt: pin A routed to IRQ 481

IOMMU group: 20

Region 0: Memory at 24000000 (64-bit, prefetchable) [size=64M]

Capabilities: [40] Power Management version 3

Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)

Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-

Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+

Address: 0000000000000000  Data: 0000

Masking: 00000000  Pending: 00000000

Capabilities: [60] Express (v2) Endpoint, MSI 00

DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us

ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W

DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+

RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-

MaxPayload 256 bytes, MaxReadReq 256 bytes

DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-

LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L1, Exit Latency L1 <4us

ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+

LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-

ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

LnkSta: Speed 2.5GT/s, Width x1 (downgraded)

TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+

10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix+, MaxEETLPPrefixes 1

EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-

FRS- TPHComp- ExtTPHComp-

AtomicOpsCap: 32bit- 64bit- 128bitCAS-

DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,

AtomicOpsCtl: ReqEn+

LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer+ 2Retimers+ DRS-

LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-

Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-

Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot

LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-

EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-

Retimer- 2Retimers- CrosslinkRes: unsupported

Capabilities: [9c] Vendor Specific Information: Len=14 <?>

Capabilities: [b0] MSI-X: Enable+ Count=9 Masked-

Vector table: BAR=0 offset=00b90000

PBA: BAR=0 offset=00ba0000

Capabilities: [100 v1] Secondary PCI Express

LnkCtl3: LnkEquIntrruptEn- PerformEqu-

LaneErrStat: 0

Capabilities: [12c v1] Latency Tolerance Reporting

Max snoop latency: 0ns

Max no snoop latency: 0ns

Capabilities: [14c v1] Data Link Feature <?>

Capabilities: [158 v1] Physical Layer 16.0 GT/s <?>

Capabilities: [188 v1] Extended Capability ID 0x2a

Capabilities: [1b8 v2] Advanced Error Reporting

UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-

CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-

CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+

AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-

MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-

HeaderLog: 00000000 00000000 00000000 00000000

Capabilities: [200 v1] Lane Margining at the Receiver <?>

Capabilities: [248 v1] Alternative Routing-ID Interpretation (ARI)

ARICap: MFVC- ACS-, Next Function: 0

ARICtl: MFVC- ACS-, Function Group: 0

Capabilities: [290 v2] L1 PM Substates

L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+

  PortCommonModeRestoreTime=0us PortTPowerOnTime=10us

L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-

  T_CommonMode=0us LTR1.2_Threshold=0ns

L1SubCtl2: T_PwrOn=10us

Capabilities: [2a4 v1] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>

Capabilities: [2c8 v1] Data Object Exchange

DOECap: IntSup+

Interrupt Message Number 008

DOECtl: IntEn-

DOESta: Busy+ IntSta+ Error+ ObjectReady-

Capabilities: [2e0 v1] Address Translation Service (ATS)

ATSCap: Invalidate Queue Depth: 00

ATSCtl: Enable+, Smallest Translation Unit: 00

Capabilities: [2e8 v1] Process Address Space ID (PASID)

PASIDCap: Exec- Priv-, Max PASID Width: 14

PASIDCtl: Enable+ Exec- Priv-

Capabilities: [2f0 v1] Device Serial Number 00-00-00-00-00-2d-b0-48

Kernel driver in use: nvidia

Kernel modules: nvidiafb, nvidia_drm, nvidia


sudo dmesg | grep -i '000f:01:00.0\|pcie bandwidth\|DOE'

[    0.071805] acpi PNP0A08:00: _OSC: platform does not support [SHPCHotplug DPC]

[    0.079257] acpi PNP0A08:01: _OSC: platform does not support [SHPCHotplug DPC]

[    0.086694] acpi PNP0A08:02: _OSC: platform does not support [SHPCHotplug DPC]

[    0.089678] acpi PNP0A08:03: _OSC: platform does not support [SHPCHotplug DPC]

[    0.090378] acpi PNP0A08:04: _OSC: platform does not support [SHPCHotplug DPC]

[    0.093715] acpi PNP0A08:05: _OSC: platform does not support [SHPCHotplug DPC]

[    0.094397] acpi PNP0A08:06: _OSC: platform does not support [SHPCHotplug DPC]

[    0.103011] acpi PNP0A08:0b: _OSC: platform does not support [SHPCHotplug DPC]

[    0.104824] pci 000f:01:00.0: [10de:2e12] type 00 class 0x030000 PCIe Endpoint

[    0.104894] pci 000f:01:00.0: BAR 0 [mem 0x24000000-0x27ffffff 64bit pref]

[    0.104964] pci 000f:01:00.0: Enabling HDA controller

[    1.110847] pci 000f:01:00.0: DOE: [2c8] ABORT timed out

[    1.110854] pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5

[    1.110866] pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5

[    1.110905] pci 000f:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at 000f:00:00.0 (capable of 32.000 Gb/s with 2.5 GT/s PCIe x16 link)

[    1.111361] pci 000f:01:00.0: Max Payload Size set to  256/ 256 (was  128), Max Read Rq  256

[    1.114872] pci 000f:01:00.0: vgaarb: setting as boot VGA device

[    1.114876] pci 000f:01:00.0: vgaarb: bridge control possible

[    1.114879] pci 000f:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none

[    2.117415] mlx5_core 0000:01:00.0: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)

[    2.608958] mlx5_core 0000:01:00.1: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)

[    3.085515] mlx5_core 0002:01:00.0: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)

[    3.570574] mlx5_core 0002:01:00.1: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)

[    6.079621] nvidia 000f:01:00.0: Adding to iommu group 20

[    6.108638] nvidia 000f:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none

[   11.160658] [drm] Initialized nvidia-drm 0.0.0 for 000f:01:00.0 on minor 1
nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Tue Dec 30 00:13:14 2025

Driver Version                            : 580.95.05

CUDA Version                              : 13.0

Attached GPUs                             : 1

GPU 0000000F:01:00.0

    Product Name                          : NVIDIA GB10

    Product Brand                         : NVIDIA RTX

    Product Architecture                  : Blackwell

    Display Mode                          : Requested functionality has been deprecated

    Display Attached                      : Yes

    Display Active                        : Disabled

    Persistence Mode                      : Enabled

    Addressing Mode                       : ATS

    MIG Mode

        Current                           : N/A

        Pending                           : N/A

    Accounting Mode                       : Disabled

    Accounting Mode Buffer Size           : 4000

    Driver Model

        Current                           : N/A

        Pending                           : N/A

    Serial Number                         : N/A

    GPU UUID                              : GPU-380f91b2-3260-8323-ee1c-3d5243189b61

    GPU PDI                               : 0x2043235a78513c47

    Minor Number                          : 0

    VBIOS Version                         : 9A.0B.0F.00.1D

    MultiGPU Board                        : No

    Board ID                              : 0xf0100

    Board Part Number                     : N/A

    GPU Part Number                       : 2E12-275-A1

    FRU Part Number                       : N/A

    Platform Info

        Chassis Serial Number             : 

        Slot Number                       : 0

        Tray Index                        : 0

        Host ID                           : 1

        Peer Type                         : Direct Connected

        Module Id                         : 1

        GPU Fabric GUID                   : 0x0000000000000000

    Inforom Version

        Image Version                     : N/A

        OEM Object                        : N/A

        ECC Object                        : N/A

        Power Management Object           : N/A

    Inforom BBX Object Flush

        Latest Timestamp                  : N/A

        Latest Duration                   : N/A

    GPU Operation Mode

        Current                           : N/A

        Pending                           : N/A

    GPU C2C Mode                          : Enabled

    GPU Virtualization Mode

        Virtualization Mode               : None

        Host VGPU Mode                    : N/A

        vGPU Heterogeneous Mode           : N/A

    GPU Recovery Action                   : None

    GSP Firmware Version                  : 580.95.05

    IBMNPU

        Relaxed Ordering Mode             : N/A

    PCI

        Bus                               : 0x01

        Device                            : 0x00

        Domain                            : 0x000F

        Base Classcode                    : 0x3

        Sub Classcode                     : 0x0

        Device Id                         : 0x2E1210DE

        Bus Id                            : 0000000F:01:00.0

        Sub System Id                     : 0x000010DE

        GPU Link Info

            PCIe Generation

                Max                       : 1

                Current                   : 1

                Device Current            : 1

                Device Max                : 5

                Host Max                  : 5

            Link Width

                Max                       : 16x

                Current                   : 1x

        Bridge Chip

            Type                          : N/A

            Firmware                      : N/A

        Replays Since Reset               : 0

        Replay Number Rollovers           : 0

        Tx Throughput                     : N/A

        Rx Throughput                     : N/A

        Atomic Caps Outbound              : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64 

        Atomic Caps Inbound               : N/A

    Fan Speed                             : N/A

    Performance State                     : P8

    Clocks Event Reasons

        Idle                              : Not Active

        Applications Clocks Setting       : Not Active

        SW Power Cap                      : Active

        HW Slowdown                       : Not Active

            HW Thermal Slowdown           : Not Active

            HW Power Brake Slowdown       : Not Active

        Sync Boost                        : Not Active

        SW Thermal Slowdown               : Not Active

        Display Clock Setting             : Not Active

    Clocks Event Reasons Counters

        SW Power Capping                  : 1179038295 us

        Sync Boost                        : 0 us

        SW Thermal Slowdown               : 0 us

        HW Thermal Slowdown               : 0 us

        HW Power Braking                  : 0 us

    Sparse Operation Mode                 : N/A

    FB Memory Usage

        Total                             : N/A

        Reserved                          : N/A

        Used                              : N/A

        Free                              : N/A

    BAR1 Memory Usage

        Total                             : N/A

        Used                              : N/A

        Free                              : N/A

    Conf Compute Protected Memory Usage

        Total                             : 0 MiB

        Used                              : 0 MiB

        Free                              : 0 MiB

    Compute Mode                          : Default

    Utilization

        GPU                               : 0 %

        Memory                            : 0 %

        Encoder                           : 0 %

        Decoder                           : 0 %

        JPEG                              : 0 %

        OFA                               : 0 %

    Encoder Stats

        Active Sessions                   : 0

        Average FPS                       : 0

        Average Latency                   : 0

    FBC Stats

        Active Sessions                   : 0

        Average FPS                       : 0

        Average Latency                   : 0

    DRAM Encryption Mode

        Current                           : N/A

        Pending                           : N/A

    ECC Mode

        Current                           : N/A

        Pending                           : N/A

    ECC Errors

        Volatile

            SRAM Correctable              : N/A

            SRAM Uncorrectable Parity     : N/A

            SRAM Uncorrectable SEC-DED    : N/A

            DRAM Correctable              : N/A

            DRAM Uncorrectable            : N/A

        Aggregate

            SRAM Correctable              : N/A

            SRAM Uncorrectable Parity     : N/A

            SRAM Uncorrectable SEC-DED    : N/A

            DRAM Correctable              : N/A

            DRAM Uncorrectable            : N/A

            SRAM Threshold Exceeded       : N/A

        Aggregate Uncorrectable SRAM Sources

            SRAM L2                       : N/A

            SRAM SM                       : N/A

            SRAM Microcontroller          : N/A

            SRAM PCIE                     : N/A

            SRAM Other                    : N/A

        Channel Repair Pending            : N/A

        TPC Repair Pending                : N/A

    Retired Pages

        Single Bit ECC                    : N/A

        Double Bit ECC                    : N/A

        Pending Page Blacklist            : N/A

    Remapped Rows                         : N/A

    Temperature

        GPU Current Temp                  : 42 C

        GPU T.Limit Temp                  : 53 C

        GPU Shutdown T.Limit Temp         : -5 C

        GPU Slowdown T.Limit Temp         : -2 C

        GPU Max Operating T.Limit Temp    : 0 C

        GPU Target Temperature            : N/A

        Memory Current Temp               : N/A

        Memory Max Operating T.Limit Temp : N/A

    GPU Power Readings

        Average Power Draw                : 4.74 W

        Instantaneous Power Draw          : 4.74 W

        Current Power Limit               : N/A

        Requested Power Limit             : N/A

        Default Power Limit               : N/A

        Min Power Limit                   : N/A

        Max Power Limit                   : N/A

    GPU Memory Power Readings 

        Average Power Draw                : N/A

        Instantaneous Power Draw          : N/A

    Module Power Readings

        Average Power Draw                : N/A

        Instantaneous Power Draw          : N/A

        Current Power Limit               : N/A

        Requested Power Limit             : N/A

        Default Power Limit               : N/A

        Min Power Limit                   : N/A

        Max Power Limit                   : N/A

    Power Smoothing                       : N/A

    Workload Power Profiles

        Requested Profiles                : N/A

        Enforced Profiles                 : N/A

    Clocks

        Graphics                          : 208 MHz

        SM                                : 208 MHz

        Memory                            : N/A

        Video                             : 598 MHz

    Applications Clocks

        Graphics                          : 2418 MHz

        Memory                            : N/A

    Default Applications Clocks

        Graphics                          : 2418 MHz

        Memory                            : N/A

    Deferred Clocks

        Memory                            : N/A

    Max Clocks

        Graphics                          : 3003 MHz

        SM                                : 3003 MHz

        Memory                            : N/A

        Video                             : 3003 MHz

    Max Customer Boost Clocks

        Graphics                          : N/A

    Clock Policy

        Auto Boost                        : N/A

        Auto Boost Default                : N/A

    Fabric

        State                             : N/A

        Status                            : N/A

        CliqueId                          : N/A

        ClusterUUID                       : N/A

        Health

            Summary                       : N/A

            Bandwidth                     : N/A

            Route Recovery in progress    : N/A

            Route Unhealthy               : N/A

            Access Timeout Recovery       : N/A

            Incorrect Configuration       : N/A

    Processes                             : None

    Capabilities

        EGM                               : disabled


nvidia-smi -q | egrep -i 'VBIOS Version|GSP Firmware Version|Driver Version|CUDA Version'

Driver Version                            : 580.95.05

CUDA Version                              : 13.0

    VBIOS Version                         : 9A.0B.0F.00.1D

    GSP Firmware Version                  : 580.95.05

This explains it: Non-functional PCIe width link - #4 by eugr

@elsaco thank you for the post above

Unfortunately, I think I have different issue.

I can understand DGX Spark does not support GPUDirect RDMA by design.
However, the issue is independent of RDMA:

the GPU PCIe root port does not train (x0), and the GPU is limited to Gen1 x1 with DOE timeouts. This occurs before any GPUDirect functionality is involved.

The first “error” from the Root Port at 000f:00:00.0 means that the downstream device, the GPU in this case, is not reporting the expected information thus Speed unknown link status.

Sample output on my device:

000f:00:00.0 PCI bridge: NVIDIA Corporation Device 22d1
---cut---
     LnkSta: Speed unknown, Width x0
     TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
---cut---

Notice there are no training errors (TrErr-) and the link training is over (Train-)

Sample output for the GPU behind the bridge:

000f:01:00.0 VGA compatible controller: NVIDIA Corporation Device 2e12 (rev a1) (prog-if 00 [VGA controller])
---cut---
     LnkSta: Speed 2.5GT/s, Width x1 (downgraded)
     TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
---cut---

TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- show a healthy link, no errors.

As @eugr explained in the other post the Spark interconnects differently compared to other PCIe systems, i.e. a desktop motherboard.

What I don’t see on my Sparks are the DOE messages you mentioned above. The only entries noticed are:

elsaco@spark2:~$ sudo dmesg | grep -i 'mailbox'
[sudo] password for elsaco:
[    1.112705] pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5
[    1.112716] pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5

DGX Spark GPU is a part of the SOC, and GPU side is connected to CPU side via C2C NVLink, not PCIe.

thank you both @elsaco @eugr for the response.

I am a bit rusty with linux kernel and reverse engineering. As there is not official topology diagrams I am trying to figureout what is happening based on how the Kernel represents each piece of hardware and connectivity.

@eugr you mentioned that “DGX Spark GPU is a part of the SOC, and GPU side is connected to CPU side via C2C NVLink, not PCIe”. However, I need the right terminology to discribe the issue

I can see the SoC/CPU should looks like the below

That simple come out from the below.

dmesg

...
[    1.119940] pci 000f:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at 000f:00:00.0 (capable of 32.000 Gb/s with 2.5 GT/s PCIe x16 link)
...

sudo lspci -tvnn
-[0000:00]---00.0-[01-0f]--+-00.0  Mellanox Technologies MT2910 Family [ConnectX-7] [15b3:1021]
                           \-00.1  Mellanox Technologies MT2910 Family [ConnectX-7] [15b3:1021]
-[0002:00]---00.0-[01-0f]--+-00.0  Mellanox Technologies MT2910 Family [ConnectX-7] [15b3:1021]
                           \-00.1  Mellanox Technologies MT2910 Family [ConnectX-7] [15b3:1021]
-[0004:00]---00.0-[01-0f]----00.0  Samsung Electronics Co Ltd Device [144d:a810]
-[0007:00]---00.0-[01-0f]----00.0  Realtek Semiconductor Co., Ltd. Device [10ec:8127]
-[0009:00]---00.0-[01-0f]----00.0  MEDIATEK Corp. Device [14c3:7925]
-[000f:00]---00.0-[01]----00.0  NVIDIA Corporation Device [10de:2e12]
sudo lspci -s 000f:01:00.0 -nn
000f:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2e12] (rev a1)
sudo lspci -s 000f:00:00.0 -nn
000f:00:00.0 PCI bridge [0604]: NVIDIA Corporation Device [10de:22d1]

Issue Summary

GPU PCIe link state appears inconsistent / degraded:

  • Upstream root port: 000f:00:00.0 (NVIDIA 22d1) reports:
    • LnkCap: Speed 32GT/s, Width x16
    • LnkSta: Speed unknown, Width x0 (DLActive shows “+” on root port view)
  • GPU endpoint: 000f:01:00.0 (NVIDIA 2e12, rev a1) reports:
    • LnkCap: Speed 2.5GT/s, Width x16
    • LnkSta: Speed 2.5GT/s, Width x1 (downgraded) (DLActive “-” on endpoint view)
  • Kernel boot log includes:
    • pci 000f:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at 000f:00:00.0

DOE (Data Object Exchange) failures on GPU endpoint:

  • Kernel log:
    • pci 000f:01:00.0: DOE: [2c8] ABORT timed out
    • pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5
    • pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5
  • lspci -vv for 000f:01:00.0 shows:
    • DOESta: Busy+ IntSta+ Error+ ObjectReady-
    • DOE capability at offset [2c8]

Attempted mitigation impacted network reachability:

  • Booting with pcie_aspm=off resulted in loss of SSH access (machine became unreachable remotely). Reverted.

Separate CPU scheduler/energy-model noise (likely unrelated but present)

  • Kernel emits repeated messages:
    • processor cpuX: EM: CPUs of 0-4,10-14 must have the same capacity
    • energy_model: Accessing cpuY policy failed
  • energy_model=off added to GRUB, but kernel reports:
    • Unknown kernel command line parameters "... energy_model=off" and continues emitting EM messages
  • sysctl kernel.sched_energy_aware=0 exists but returns “Operation not supported”, causing systemd-sysctl to fail.

I suspect the issue is related with a bug between BIOS/Kernel.
Also I am trying to connect that with some failures I am facing with Ray+vLLM when I am loading LLM with +70B params in cluster mode.

How can I open a bug or support ticket to NVIDIA for that?

What you are seeing is normal, I see the same on my system. The kernel sees it like a PCI connected device, but it’s not.

As for your failures to run vLLM in cluster mode - you are welcome to try our community Docker build that works just fine for me and others here. Be sure to follow NVIDIA playbook on how to connect two Sparks first (it’s also linked from README in the repo):

Thank you I will try the Community Docker build,

I am able to load LLMs in cluster mode. The problems starts when I am reaching the limits with GPU errors.

Well, if you are still having issues after trying the community build, post the launch command and the error here.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.