[H200] XID 94 "Contained: SM (0x1). RST: No, D-RST: No" with zero ECC counter increment, preceded by XID 137 (TLC RX PRIV Error) cascade

Summary

I’m seeking help interpreting a series of XID errors observed in a multi-node vLLM inference deployment running across two 8x H200 SXM nodes (16 GPUs total) in Kubernetes. NCCL is used for both intra-node communication (NVLink) and inter-node communication (IB/RoCE).

On one of the two nodes (where we have kernel-log access), XID 137 (NVLink TLC RX PRIV Error) was reported across 6 of 8 GPUs, immediately followed by XID 94 with the payload Contained: SM (0x1). RST: No, D-RST: No — all within the same second. The application hung and recovered after a pod restart.

Importantly, the peer node — to which we currently do not have direct kernel-log access — also shows multiple XID 94 occurrences in its DCGM metrics history (DCGM_FI_DEV_XID_ERRORS{err_code="94"}). This strongly suggests the issue is not isolated to a single node’s hardware, but is a recurring pattern affecting both nodes of this deployment.

The puzzling part: after the incident, none of the ECC counters or row remap counters were incremented on any GPU, despite XID 94 being documented as “Contained ECC Error.” All GPU UUIDs remained unchanged (no RMA), and no InfoROM corruption is reported. I would like to understand whether this combination is expected, and what Contained: SM specifically indicates.

Environment

  • GPU: 8x NVIDIA H200 SXM per node (2 nodes total, 16 GPUs aggregate)

  • Driver: 575.51.03

  • OS: Red Hat Enterprise Linux 8.6 (Ootpa)

  • Orchestration: Kubernetes with NVIDIA GPU Operator

  • Monitoring: DCGM 4.5.2 / dcgm-exporter 4.8.1

  • Inter-node fabric: InfiniBand (used by NCCL for cross-node collective communication)

  • Workload: Multi-node vLLM inference across 2 nodes × 8 GPUs. The exact parallelism configuration (tensor parallel size, pipeline parallel size, possibly with Ray coordination) at the time of the incident is being verified with our operations team.

  • Container image: built on RHEL UBI 9.6, hosting multiple vLLM virtual environments. Versions present in the container include vLLM 0.8.5, 0.10.0, 0.11.0, 0.14.0, 0.16.0, 0.17.1, 0.18.0, and 0.19.0. Since v0.19.0 was not yet released on January 9, the active version at the time of the incident must have been an earlier one. We are working with our operations team to confirm the exact version (and NCCL version) active during the incident; will update this thread once confirmed.

Kernel log: zgrep "NVRM: Xid" /var/log/messages*

The relevant entries from /var/log/messages, showing a single-day cascade on January 9 at 08:54:08, preceded by an unrelated XID 31 on January 8:


Jan 8 18:38:50 <host> kernel: NVRM: Xid (PCI:0000:3b:00): 31, pid=3944045, name=python3.12, Ch 0000000c, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC4 GPCCLIENT_T1_5 faulted @ ... of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:9b:00): 137, TLC RX interrupt hit on link 6 on GPU0: PRIV Error

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:db:00): 137, TLC RX interrupt hit on link 8 on GPU0: PRIV Error

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:cb:00): 137, TLC RX interrupt hit on link 0 on GPU0: PRIV Error

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:bb:00): 137, TLC RX interrupt hit on link 2 on GPU0: PRIV Error

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:5d:00): 137, TLC RX interrupt hit on link 4 on GPU0: PRIV Error

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:4c:00): 137, TLC RX interrupt hit on link 4 on GPU0: PRIV Error

... (many additional XID 137 entries across PCI bus IDs 4c, 5d, 9b, bb, cb, db on various NVLink link numbers 0-17)

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:5d:00): 94, Contained: SM (0x1). RST: No, D-RST: No

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:db:00): 94, Contained: SM (0x1). RST: No, D-RST: No

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:4c:00): 94, Contained: SM (0x1). RST: No, D-RST: No

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:cb:00): 94, Contained: SM (0x1). RST: No, D-RST: No

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:9b:00): 94, Contained: SM (0x1). RST: No, D-RST: No

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:bb:00): 94, Contained: SM (0x1). RST: No, D-RST: No

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:db:00): 94, pid=3944065, name=python3.12, Ch 0000000a

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:4c:00): 94, pid=3944047, name=python3.12, Ch 0000000a

Jan 9 08:54:08 <host> kernel: NVRM: Xid (PCI:0000:5d:00): 94, pid=3944048, name=python3.12, Ch 0000000a

... (per-channel reports following the containment events, channels 0x0a through 0x0e)

Affected: 6 of 8 GPUs on this node (PCI 0000:4c:00, 5d:00, 9b:00, bb:00, cb:00, db:00). The application was python3.12 running vLLM, distributed across this node and one peer node (16 GPUs total).

Peer node correlation: We do not have direct kernel-log access to the peer node, but the DCGM metric DCGM_FI_DEV_XID_ERRORS{err_code="94"} shows multiple XID 94 occurrences on the peer node as well, observed at various points during the same operational period. The peer node uses the same driver version, container image, and workload as this node.

GPU state after the incident (current)

nvidia-smi -q -d ECC,ROW_REMAPPER on all 8 GPUs shows zero for every counter:

  • DRAM Correctable / Uncorrectable (Volatile and Aggregate): all 0

  • SRAM Correctable / Uncorrectable (Volatile and Aggregate): all 0

  • Row Remapper Correctable / Uncorrectable / Pending / Failure: all 0

No “infoROM is corrupted” warning. GPU UUIDs were confirmed unchanged from before the incident via historical DCGM metrics labeled with UUID (no GPU has been physically replaced).

Questions

1. Meaning of Contained: SM (0x1). RST: No, D-RST: No

The public XID Errors documentation classifies XID 94 as “Contained ECC Error,” but the message payload here explicitly references “SM” rather than memory, and both reset flags indicate no GPU or device reset is required. Is this a containment of an SM-level error rather than a memory ECC event? Is there any official documentation of the message payload fields (the Contained: <subsystem> subtype, the numeric mask 0x1, and the RST/D-RST flags) for this XID?

2. Expected ECC counter behavior

Is it expected for XID 94 to be reported without any ECC counter or row remap increment? Our prior understanding was that contained ECC errors leave a persistent trace in the InfoROM counters (DRAM or SRAM, Aggregate). The complete absence of such a trace, combined with RST: No, D-RST: No, suggests this may not be a persistent memory fault. Is there a class of containment that does not register in ECC counters?

3. XID 137 → XID 94 cascade across two nodes

Both errors occurred within the same second across overlapping sets of GPUs on the node where we have kernel-log access, suggesting a causal or cascading relationship through NVLink communication on that node. The fact that the peer node also shows recurring XID 94 events (visible in DCGM metrics) strongly suggests this is not a single-node hardware issue, but rather:

  • A software-side issue (NCCL, CUDA driver, or vLLM-level) common to both nodes, or

  • A driver-level issue in 575.51.03 affecting H200 systems, or

  • A cross-node communication pattern (over IB/RoCE) that triggers SM containment on both ends

Specifically, we would like to understand:

  • Could an issue originating on the inter-node fabric (IB/RoCE) propagate as XID 137 (intra-node NVLink TLC PRIV Error)? Or is XID 137 strictly an intra-node NVLink event?

  • What is the recommended diagnostic action for XID 137 (TLC RX PRIV Error)? Is it typically caused by:

  • A software-side error (e.g., NCCL or CUDA driver interaction) sending malformed NVLink transactions?

  • A transient hardware-level event on the NVLink/NVSwitch fabric?

  • Something propagated from the inter-node fabric (IB/RoCE) via NCCL?

  • Are there known driver 575.x or NCCL issues that manifest as XID 137 followed by XID 94 “Contained: SM” in multi-node H200 deployments?

Additional context

  • We have not been able to find a clear public reference for the Contained: SM payload format. If this documentation exists internally, a pointer would be greatly appreciated.

Thank you for any guidance.