Containerized illegal memory access requires restart of entire node

heemstrajan · October 10, 2025, 11:45am

For a university project I have access to a containerized kubernetes cluster using runai with 4 nodes, with 8 B200 GPUs in each node. Most of the workloads run on the cluster are AI workloads, but we are working on a heterogeneous compute workload made in CUDA/c++ and thrust. This means that if we mess up one iterator or if an integer overflow occurs, we are dealing with an illegal memory access. On our own machines this is not an issue, and in fact we just debug the issue with cuda-gdb to find the culprit. However, on these machines it apparently requires a complete restart of the node. This was confusing to me, as the documentation on the relevant XID error 31 (Analyzing Xid Errors with the Xid Catalog — XID Errors) states that only the app has to be restarted. I talked to the maintainers of the system, and they say that this is inherent to the GPU-Operator device plugin. They also said that the cluster is based on the nVidia Reference Architecture, so that it is configured as recommended by nVidia.

It seems rather strange to me that an invalid memory access on a single GPU of a node would require the entire node to be restarted, with as a consequence workload disruptions to all other tasks running on that node. It is practially impossible for us to 100% guarantee that our tasks are completely free of invalid memory accesses. Is there really no way to configure such a system to be more robust against invalid memory accesses?

Topic		Replies	Views
Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ kernel problem or driver issue? CUDA Programming and Performance	6	11999	October 12, 2021
Help catching an illegal memory access CUDA Programming and Performance decoder , cuda , debugger	15	4182	November 7, 2024
cudaMemset: illegal memory access with RTX5090 with 570.86.16 CUDA Programming and Performance llama	24	1261	July 16, 2025
Incidental error 700 - an illegal memory access is encountered CUDA Programming and Performance cuda	5	9316	March 25, 2021
Does sticky CUDA error affect other host processes using the same GPU? CUDA Programming and Performance	7	669	October 8, 2022
Illegal access memory error Linux cuda	0	426	August 19, 2020
Is it possible to reset GPU w/o rebooting? CUDA Programming and Performance	2	1740	November 3, 2009
Riva ASR quickstart throws cudaError: "an illegal memory access was encountered" Riva riva	7	1360	October 14, 2023
PC crashing everytime the CUDA program crashes any way to prevent this/ is it normal for this to hap CUDA Programming and Performance	5	3195	March 9, 2009
CUDA error: an illegal memory access was encountered Linux	0	908	October 28, 2020

Containerized illegal memory access requires restart of entire node

Related topics