[H200] XID 94 "Contained: SM (0x1). RST: No, D-RST: No" with zero ECC counter increment, preceded by XID 137 (TLC RX PRIV Error) cascade

IntStr · May 12, 2026, 4:05am

Summary

I’m seeking help interpreting a series of XID errors observed in a multi-node vLLM inference deployment running across two 8x H200 SXM nodes (16 GPUs total) in Kubernetes. NCCL is used for both intra-node communication (NVLink) and inter-node communication (IB/RoCE).

On one of the two nodes (where we have kernel-log access), XID 137 (NVLink TLC RX PRIV Error) was reported across 6 of 8 GPUs, immediately followed by XID 94 with the payload Contained: SM (0x1). RST: No, D-RST: No — all within the same second. The application hung and recovered after a pod restart.

Importantly, the peer node — to which we currently do not have direct kernel-log access — also shows multiple XID 94 occurrences in its DCGM metrics history (DCGM_FI_DEV_XID_ERRORS{err_code="94"}). This strongly suggests the issue is not isolated to a single node’s hardware, but is a recurring pattern affecting both nodes of this deployment.

The puzzling part: after the incident, none of the ECC counters or row remap counters were incremented on any GPU, despite XID 94 being documented as “Contained ECC Error.” All GPU UUIDs remained unchanged (no RMA), and no InfoROM corruption is reported. I would like to understand whether this combination is expected, and what Contained: SM specifically indicates.

Environment

GPU: 8x NVIDIA H200 SXM per node (2 nodes total, 16 GPUs aggregate)
Driver: 575.51.03
OS: Red Hat Enterprise Linux 8.6 (Ootpa)
Orchestration: Kubernetes with NVIDIA GPU Operator
Monitoring: DCGM 4.5.2 / dcgm-exporter 4.8.1
Inter-node fabric: InfiniBand (used by NCCL for cross-node collective communication)
Workload: Multi-node vLLM inference across 2 nodes × 8 GPUs. The exact parallelism configuration (tensor parallel size, pipeline parallel size, possibly with Ray coordination) at the time of the incident is being verified with our operations team.
Container image: built on RHEL UBI 9.6, hosting multiple vLLM virtual environments. Versions present in the container include vLLM 0.8.5, 0.10.0, 0.11.0, 0.14.0, 0.16.0, 0.17.1, 0.18.0, and 0.19.0. Since v0.19.0 was not yet released on January 9, the active version at the time of the incident must have been an earlier one. We are working with our operations team to confirm the exact version (and NCCL version) active during the incident; will update this thread once confirmed.

Kernel log: `zgrep "NVRM: Xid" /var/log/messages*`

The relevant entries from /var/log/messages, showing a single-day cascade on January 9 at 08:54:08, preceded by an unrelated XID 31 on January 8:


Jan 8 18:38:50 <host> kernel: NVRM: Xid (PCI:0000:3b:00): 31, pid=3944045, name=python3.12, Ch 0000000c, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC4 GPCCLIENT_T1_5 faulted @ ... of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:9b:00): 137, TLC RX interrupt hit on link 6 on GPU0: PRIV Error

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:db:00): 137, TLC RX interrupt hit on link 8 on GPU0: PRIV Error

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:cb:00): 137, TLC RX interrupt hit on link 0 on GPU0: PRIV Error

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:bb:00): 137, TLC RX interrupt hit on link 2 on GPU0: PRIV Error

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:5d:00): 137, TLC RX interrupt hit on link 4 on GPU0: PRIV Error

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:4c:00): 137, TLC RX interrupt hit on link 4 on GPU0: PRIV Error

... (many additional XID 137 entries across PCI bus IDs 4c, 5d, 9b, bb, cb, db on various NVLink link numbers 0-17)

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:5d:00): 94, Contained: SM (0x1). RST: No, D-RST: No

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:db:00): 94, Contained: SM (0x1). RST: No, D-RST: No

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:4c:00): 94, Contained: SM (0x1). RST: No, D-RST: No

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:cb:00): 94, Contained: SM (0x1). RST: No, D-RST: No

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:9b:00): 94, Contained: SM (0x1). RST: No, D-RST: No

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:bb:00): 94, Contained: SM (0x1). RST: No, D-RST: No

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:db:00): 94, pid=3944065, name=python3.12, Ch 0000000a

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:4c:00): 94, pid=3944047, name=python3.12, Ch 0000000a

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:5d:00): 94, pid=3944048, name=python3.12, Ch 0000000a

... (per-channel reports following the containment events, channels 0x0a through 0x0e)

Affected: 6 of 8 GPUs on this node (PCI 0000:4c:00, 5d:00, 9b:00, bb:00, cb:00, db:00). The application was python3.12 running vLLM, distributed across this node and one peer node (16 GPUs total).

Peer node correlation: We do not have direct kernel-log access to the peer node, but the DCGM metric DCGM_FI_DEV_XID_ERRORS{err_code="94"} shows multiple XID 94 occurrences on the peer node as well, observed at various points during the same operational period. The peer node uses the same driver version, container image, and workload as this node.

GPU state after the incident (current)

nvidia-smi -q -d ECC,ROW_REMAPPER on all 8 GPUs shows zero for every counter:

DRAM Correctable / Uncorrectable (Volatile and Aggregate): all 0
SRAM Correctable / Uncorrectable (Volatile and Aggregate): all 0
Row Remapper Correctable / Uncorrectable / Pending / Failure: all 0

No “infoROM is corrupted” warning. GPU UUIDs were confirmed unchanged from before the incident via historical DCGM metrics labeled with UUID (no GPU has been physically replaced).

Questions

1. Meaning of Contained: SM (0x1). RST: No, D-RST: No

The public XID Errors documentation classifies XID 94 as “Contained ECC Error,” but the message payload here explicitly references “SM” rather than memory, and both reset flags indicate no GPU or device reset is required. Is this a containment of an SM-level error rather than a memory ECC event? Is there any official documentation of the message payload fields (the Contained: <subsystem> subtype, the numeric mask 0x1, and the RST/D-RST flags) for this XID?

2. Expected ECC counter behavior

Is it expected for XID 94 to be reported without any ECC counter or row remap increment? Our prior understanding was that contained ECC errors leave a persistent trace in the InfoROM counters (DRAM or SRAM, Aggregate). The complete absence of such a trace, combined with RST: No, D-RST: No, suggests this may not be a persistent memory fault. Is there a class of containment that does not register in ECC counters?

3. XID 137 → XID 94 cascade across two nodes

Both errors occurred within the same second across overlapping sets of GPUs on the node where we have kernel-log access, suggesting a causal or cascading relationship through NVLink communication on that node. The fact that the peer node also shows recurring XID 94 events (visible in DCGM metrics) strongly suggests this is not a single-node hardware issue, but rather:

A software-side issue (NCCL, CUDA driver, or vLLM-level) common to both nodes, or
A driver-level issue in 575.51.03 affecting H200 systems, or
A cross-node communication pattern (over IB/RoCE) that triggers SM containment on both ends

Specifically, we would like to understand:

Could an issue originating on the inter-node fabric (IB/RoCE) propagate as XID 137 (intra-node NVLink TLC PRIV Error)? Or is XID 137 strictly an intra-node NVLink event?
What is the recommended diagnostic action for XID 137 (TLC RX PRIV Error)? Is it typically caused by:
A software-side error (e.g., NCCL or CUDA driver interaction) sending malformed NVLink transactions?
A transient hardware-level event on the NVLink/NVSwitch fabric?
Something propagated from the inter-node fabric (IB/RoCE) via NCCL?
Are there known driver 575.x or NCCL issues that manifest as XID 137 followed by XID 94 “Contained: SM” in multi-node H200 deployments?

Additional context

We have not been able to find a clear public reference for the Contained: SM payload format. If this documentation exists internally, a pointer would be greatly appreciated.

Thank you for any guidance.

Topic		Replies	Views
8x RTX4080 Super/550.135: NVRM: Xid (PCI:0000:41:00): 32, pid='<unknown>', name=<unknown>, Channel ID 00000004 intr 00800000 Linux	0	41	December 15, 2025
RTX PRO 6000 Blackwell — Persistent Xid 31 (MMU Fault) and Xid 13 errors, fault follows card across PCIe slots Linux	1	108	April 26, 2026
Nvidia XID error message Linux	1	872	July 9, 2022
Deciphering an NVRM: Xid message? CUDA Programming and Performance	27	78588	April 1, 2012
Random Xid 62 error on ML workloads - Titan RTX Linux	0	764	July 8, 2020
Hung/frozen machine with X370 board, GTX 1060 card, Ryzen 5 CPU - Xid 32 & 69 - all driver versions Linux	13	1800	December 29, 2018
Bug Report: GPU Driver Hang with Specific Workloads on H100 and Nvidia 550, 555 Linux	4	1244	October 28, 2025
XID Errors in DGX-1 (GPU's don't start) DGX Systems (Data Center)	2	1441	April 1, 2022
NVRM Xid error 59 with Kepler card (CUDA) on 4th PCIe 3.0 port Linux	6	5074	July 2, 2013
Xid errors after resuming from suspend Linux	3	1374	March 20, 2019