Looking at your journalctl excerpt, three signals stand out:
1. CDI Device Injection Failure (20:05:33)
dockerd: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all
Your container launched without proper GPU device mapping. Docker couldn’t resolve the NVIDIA GPU device interface. This happens at startup, ~7 minutes before the crash.
2. Memory Allocation Failure (20:12:10–20:12:11)
kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051)
returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
Two consecutive failures. The driver could not allocate from the descriptor pool. On GB10, CPU and GPU share one LPDDR5X pool—no separate boundary. A driver-level allocation failure means the pool is exhausted.
3. Cable Removal (20:02:52)
cx7-pcie-hotplug MTKP0001:00: Cable removal
This signal does not appear related to the shutdown sequence.
The failure chain: CDI device injection fails at container startup → driver exhausts LPDDR5X descriptor pool 7 minutes later under vLLM load → two-second cascade of allocation failures → shutdown.
This pattern matches a broader failure class on GB10 when unified memory is exhausted. See: System crashes when memory is full