ASUS Ascent GX10 (GB10) hard shutdown under heavy vLLM load | of_root node is NULL and EM: CPUs must have same capacity dmesg errors

Looking at your journalctl excerpt, three signals stand out:

1. CDI Device Injection Failure (20:05:33)

dockerd: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all

Your container launched without proper GPU device mapping. Docker couldn’t resolve the NVIDIA GPU device interface. This happens at startup, ~7 minutes before the crash.

2. Memory Allocation Failure (20:12:10–20:12:11)

kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051)
returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359

Two consecutive failures. The driver could not allocate from the descriptor pool. On GB10, CPU and GPU share one LPDDR5X pool—no separate boundary. A driver-level allocation failure means the pool is exhausted.

3. Cable Removal (20:02:52)

cx7-pcie-hotplug MTKP0001:00: Cable removal

This signal does not appear related to the shutdown sequence.

The failure chain: CDI device injection fails at container startup → driver exhausts LPDDR5X descriptor pool 7 minutes later under vLLM load → two-second cascade of allocation failures → shutdown.

This pattern matches a broader failure class on GB10 when unified memory is exhausted. See: System crashes when memory is full