Containerized illegal memory access requires restart of entire node

For a university project I have access to a containerized kubernetes cluster using runai with 4 nodes, with 8 B200 GPUs in each node. Most of the workloads run on the cluster are AI workloads, but we are working on a heterogeneous compute workload made in CUDA/c++ and thrust. This means that if we mess up one iterator or if an integer overflow occurs, we are dealing with an illegal memory access. On our own machines this is not an issue, and in fact we just debug the issue with cuda-gdb to find the culprit. However, on these machines it apparently requires a complete restart of the node. This was confusing to me, as the documentation on the relevant XID error 31 (Analyzing Xid Errors with the Xid Catalog — XID Errors) states that only the app has to be restarted. I talked to the maintainers of the system, and they say that this is inherent to the GPU-Operator device plugin. They also said that the cluster is based on the nVidia Reference Architecture, so that it is configured as recommended by nVidia.

It seems rather strange to me that an invalid memory access on a single GPU of a node would require the entire node to be restarted, with as a consequence workload disruptions to all other tasks running on that node. It is practially impossible for us to 100% guarantee that our tasks are completely free of invalid memory accesses. Is there really no way to configure such a system to be more robust against invalid memory accesses?

1 Like