I have issues with the GPU when I am trying to use the two Spark in a cluster mode with Ray +vllm
Problem statement
GPU PCIe link does not train correctly.
Root port 000f:00:00.0 reports LnkSta: Speed unknown, Width x0.
GPU 000f:01:00.0 is stuck at PCIe Gen1 x1.
Kernel logs report 0.000 Gb/s available PCIe bandwidth and repeated DOE mailbox timeouts.
Issue persists across reboots, kernel parameters, and OS configuration changes.
I suspect hardware failure.
I did update the system with the latests updates for OS and BIOS etc.
However the issue persists.
any thoughts?
uname -a
Linux koula 6.14.0-1015-nvidia #15-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 25 18:02:16 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
NVIDIA_DGX_Spark A.7 P4242 A04 Vendor: American Megatrends International, LLC. Version: 5.36_0ACUM018 Release Date: 08/06/2025 American Megatrends International, LLC. 5.36_0ACUM018 08/06/2025
From dmesg:
[ 0.097574] pci 0009:00:00.0: Max Payload Size set to 512/ 512 (was 128), Max Read Rq 512
[ 0.097594] pci 0009:01:00.0: Max Payload Size set to 256/ 256 (was 128), Max Read Rq 256
[ 0.102037] platform NVDA8800:00: failed to claim resource 0: [mem 0x05170000-0x051cffff]
[ 0.102050] acpi NVDA8800:00: platform device creation failed: -16
[ 0.102126] platform NVDA8900:00: failed to claim resource 0: [mem 0xc8000000-0xd7ffffff]
[ 0.102132] acpi NVDA8900:00: platform device creation failed: -16
[ 0.102927] ACPI: PCI Root Bridge [PCIF] (domain 000f [bus 00-01])
[ 0.102938] acpi PNP0A08:0b: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI EDR HPX-Type3]
[ 0.103011] acpi PNP0A08:0b: _OSC: platform does not support [SHPCHotplug DPC]
[ 0.103105] acpi PNP0A08:0b: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability LTR]
[ 0.103855] acpi PNP0A08:0b: ECAM area [mem 0x29000000-0x291fffff] reserved by PNP0C02:01
[ 0.103873] acpi PNP0A08:0b: ECAM at [mem 0x29000000-0x291fffff] for [bus 00-01]
[ 0.104003] PCI host bridge to bus 000f:00
[ 0.104032] pci_bus 000f:00: root bus resource [mem 0x24000000-0x281fffff window]
[ 0.104037] pci_bus 000f:00: root bus resource [bus 00-01]
[ 0.104071] pci 000f:00:00.0: [10de:22d1] type 01 class 0x060400 PCIe Root Port
[ 0.104108] pci 000f:00:00.0: PCI bridge to [bus 01]
[ 0.104134] pci 000f:00:00.0: bridge window [mem 0x24000000-0x27ffffff 64bit pref]
[ 0.104291] pci 000f:00:00.0: PME# supported from D0 D3hot
[ 0.104824] pci 000f:01:00.0: [10de:2e12] type 00 class 0x030000 PCIe Endpoint
[ 0.104894] pci 000f:01:00.0: BAR 0 [mem 0x24000000-0x27ffffff 64bit pref]
[ 0.104964] pci 000f:01:00.0: Enabling HDA controller
[ 1.110847] pci 000f:01:00.0: DOE: [2c8] ABORT timed out
[ 1.110854] pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5
[ 1.110866] pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5
[ 1.110905] pci 000f:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at 000f:00:00.0 (capable of 32.000 Gb/s with 2.5 GT/s PCIe x16 link)
[ 1.111279] pci 000f:00:00.0: PCI bridge to [bus 01]
[ 1.111304] pci 000f:00:00.0: PCI bridge to [bus 01]
sudo dmidecode -t bios
# dmidecode 3.5
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.
Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
Vendor: American Megatrends International, LLC.
Version: 5.36_0ACUM018
Release Date: 08/06/2025
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 32 MB
Characteristics:
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
ACPI is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 5.36
Handle 0x0006, DMI type 13, 22 bytes
BIOS Language Information
Language Description Format: Long
Installable Languages: 1
en|US|iso8859-1
Currently Installed Language: en|US|iso8859-1
sudo lspci -s 000f:00:00.0 -vv
000f:00:00.0 PCI bridge: NVIDIA Corporation Device 22d1 (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin ? routed to IRQ 343
IOMMU group: 5
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: [disabled] [32-bit]
Memory behind bridge: [disabled] [32-bit]
Prefetchable memory behind bridge: 24000000-27ffffff [size=64M] [32-bit]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Express (v2) Root Port (Slot-), MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0
ExtTag+ RBE+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 512 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM L1, Exit Latency L1 <32us
ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 128 bytes, Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt+
LnkSta: Speed unknown, Width x0
TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
RootCap: CRSVisible+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR+
10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS+ LN System CLS Not Supported, TPHComp+ ExtTPHComp- ARIFwd+
AtomicOpsCap: Routing- 32bit+ 64bit+ 128bitCAS+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled, ARIFwd+
AtomicOpsCtl: ReqEn- EgressBlck-
LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS+
LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: Downstream Port, DRS-
DownstreamComp: Link Up - Present
Capabilities: [84] MSI: Enable+ Count=4/4 Maskable+ 64bit+
Address: 0000000006850040 Data: 0000
Masking: 0000000a Pending: 00000000
Capabilities: [100 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [12c v1] Data Link Feature <?>
Capabilities: [138 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [168 v1] Extended Capability ID 0x2a
Capabilities: [198 v1] Lane Margining at the Receiver <?>
Capabilities: [1f4 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap+
HeaderLog: 00000000 00000000 00000000 00000000
RootCmd: CERptEn+ NFERptEn+ FERptEn+
RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
FirstFatal- NonFatalMsg- FatalMsg- IntMsg 2
ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
Capabilities: [23c v1] Access Control Services
ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
Capabilities: [244 v1] FRS Queueing <?>
Capabilities: [298 v1] Hierarchy ID <?>
Capabilities: [2b8 v1] Extended Capability ID 0x30
Kernel driver in use: pcieport
sudo lspci -s 000f:01:00.0 -vv
000f:01:00.0 VGA compatible controller: NVIDIA Corporation Device 2e12 (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 0000
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 481
IOMMU group: 20
Region 0: Memory at 24000000 (64-bit, prefetchable) [size=64M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 256 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L1, Exit Latency L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix+, MaxEETLPPrefixes 1
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn+
LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [9c] Vendor Specific Information: Len=14 <?>
Capabilities: [b0] MSI-X: Enable+ Count=9 Masked-
Vector table: BAR=0 offset=00b90000
PBA: BAR=0 offset=00ba0000
Capabilities: [100 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [12c v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [14c v1] Data Link Feature <?>
Capabilities: [158 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [188 v1] Extended Capability ID 0x2a
Capabilities: [1b8 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [200 v1] Lane Margining at the Receiver <?>
Capabilities: [248 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [290 v2] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=0us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Capabilities: [2a4 v1] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
Capabilities: [2c8 v1] Data Object Exchange
DOECap: IntSup+
Interrupt Message Number 008
DOECtl: IntEn-
DOESta: Busy+ IntSta+ Error+ ObjectReady-
Capabilities: [2e0 v1] Address Translation Service (ATS)
ATSCap: Invalidate Queue Depth: 00
ATSCtl: Enable+, Smallest Translation Unit: 00
Capabilities: [2e8 v1] Process Address Space ID (PASID)
PASIDCap: Exec- Priv-, Max PASID Width: 14
PASIDCtl: Enable+ Exec- Priv-
Capabilities: [2f0 v1] Device Serial Number 00-00-00-00-00-2d-b0-48
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nvidia_drm, nvidia
sudo dmesg | grep -i '000f:01:00.0\|pcie bandwidth\|DOE'
[ 0.071805] acpi PNP0A08:00: _OSC: platform does not support [SHPCHotplug DPC]
[ 0.079257] acpi PNP0A08:01: _OSC: platform does not support [SHPCHotplug DPC]
[ 0.086694] acpi PNP0A08:02: _OSC: platform does not support [SHPCHotplug DPC]
[ 0.089678] acpi PNP0A08:03: _OSC: platform does not support [SHPCHotplug DPC]
[ 0.090378] acpi PNP0A08:04: _OSC: platform does not support [SHPCHotplug DPC]
[ 0.093715] acpi PNP0A08:05: _OSC: platform does not support [SHPCHotplug DPC]
[ 0.094397] acpi PNP0A08:06: _OSC: platform does not support [SHPCHotplug DPC]
[ 0.103011] acpi PNP0A08:0b: _OSC: platform does not support [SHPCHotplug DPC]
[ 0.104824] pci 000f:01:00.0: [10de:2e12] type 00 class 0x030000 PCIe Endpoint
[ 0.104894] pci 000f:01:00.0: BAR 0 [mem 0x24000000-0x27ffffff 64bit pref]
[ 0.104964] pci 000f:01:00.0: Enabling HDA controller
[ 1.110847] pci 000f:01:00.0: DOE: [2c8] ABORT timed out
[ 1.110854] pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5
[ 1.110866] pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5
[ 1.110905] pci 000f:01:00.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at 000f:00:00.0 (capable of 32.000 Gb/s with 2.5 GT/s PCIe x16 link)
[ 1.111361] pci 000f:01:00.0: Max Payload Size set to 256/ 256 (was 128), Max Read Rq 256
[ 1.114872] pci 000f:01:00.0: vgaarb: setting as boot VGA device
[ 1.114876] pci 000f:01:00.0: vgaarb: bridge control possible
[ 1.114879] pci 000f:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[ 2.117415] mlx5_core 0000:01:00.0: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)
[ 2.608958] mlx5_core 0000:01:00.1: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)
[ 3.085515] mlx5_core 0002:01:00.0: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)
[ 3.570574] mlx5_core 0002:01:00.1: 126.028 Gb/s available PCIe bandwidth (32.0 GT/s PCIe x4 link)
[ 6.079621] nvidia 000f:01:00.0: Adding to iommu group 20
[ 6.108638] nvidia 000f:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[ 11.160658] [drm] Initialized nvidia-drm 0.0.0 for 000f:01:00.0 on minor 1
nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Tue Dec 30 00:13:14 2025
Driver Version : 580.95.05
CUDA Version : 13.0
Attached GPUs : 1
GPU 0000000F:01:00.0
Product Name : NVIDIA GB10
Product Brand : NVIDIA RTX
Product Architecture : Blackwell
Display Mode : Requested functionality has been deprecated
Display Attached : Yes
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : ATS
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-380f91b2-3260-8323-ee1c-3d5243189b61
GPU PDI : 0x2043235a78513c47
Minor Number : 0
VBIOS Version : 9A.0B.0F.00.1D
MultiGPU Board : No
Board ID : 0xf0100
Board Part Number : N/A
GPU Part Number : 2E12-275-A1
FRU Part Number : N/A
Platform Info
Chassis Serial Number :
Slot Number : 0
Tray Index : 0
Host ID : 1
Peer Type : Direct Connected
Module Id : 1
GPU Fabric GUID : 0x0000000000000000
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU C2C Mode : Enabled
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
vGPU Heterogeneous Mode : N/A
GPU Recovery Action : None
GSP Firmware Version : 580.95.05
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x000F
Base Classcode : 0x3
Sub Classcode : 0x0
Device Id : 0x2E1210DE
Bus Id : 0000000F:01:00.0
Sub System Id : 0x000010DE
GPU Link Info
PCIe Generation
Max : 1
Current : 1
Device Current : 1
Device Max : 5
Host Max : 5
Link Width
Max : 16x
Current : 1x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : N/A
Rx Throughput : N/A
Atomic Caps Outbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Atomic Caps Inbound : N/A
Fan Speed : N/A
Performance State : P8
Clocks Event Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Clocks Event Reasons Counters
SW Power Capping : 1179038295 us
Sync Boost : 0 us
SW Thermal Slowdown : 0 us
HW Thermal Slowdown : 0 us
HW Power Braking : 0 us
Sparse Operation Mode : N/A
FB Memory Usage
Total : N/A
Reserved : N/A
Used : N/A
Free : N/A
BAR1 Memory Usage
Total : N/A
Used : N/A
Free : N/A
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
GPU : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
DRAM Encryption Mode
Current : N/A
Pending : N/A
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
SRAM Threshold Exceeded : N/A
Aggregate Uncorrectable SRAM Sources
SRAM L2 : N/A
SRAM SM : N/A
SRAM Microcontroller : N/A
SRAM PCIE : N/A
SRAM Other : N/A
Channel Repair Pending : N/A
TPC Repair Pending : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 42 C
GPU T.Limit Temp : 53 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating T.Limit Temp : N/A
GPU Power Readings
Average Power Draw : 4.74 W
Instantaneous Power Draw : 4.74 W
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
GPU Memory Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Module Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Power Smoothing : N/A
Workload Power Profiles
Requested Profiles : N/A
Enforced Profiles : N/A
Clocks
Graphics : 208 MHz
SM : 208 MHz
Memory : N/A
Video : 598 MHz
Applications Clocks
Graphics : 2418 MHz
Memory : N/A
Default Applications Clocks
Graphics : 2418 MHz
Memory : N/A
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 3003 MHz
SM : 3003 MHz
Memory : N/A
Video : 3003 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Fabric
State : N/A
Status : N/A
CliqueId : N/A
ClusterUUID : N/A
Health
Summary : N/A
Bandwidth : N/A
Route Recovery in progress : N/A
Route Unhealthy : N/A
Access Timeout Recovery : N/A
Incorrect Configuration : N/A
Processes : None
Capabilities
EGM : disabled
nvidia-smi -q | egrep -i 'VBIOS Version|GSP Firmware Version|Driver Version|CUDA Version'
Driver Version : 580.95.05
CUDA Version : 13.0
VBIOS Version : 9A.0B.0F.00.1D
GSP Firmware Version : 580.95.05
