Hi everyone,
I am investigating the error containment behavior of NVIDIA GPUs (specifically A100/H100 architectures) as outlined in the NVIDIA GPU Memory Error Management documentation.
The documentation states that certain uncorrectable ECC errors (e.g, Double-Bit ECC Error (DBE)) can be “contained” to the specific application that owns the corrupted GPU memory, preventing a full GPU reset or impact on unrelated processes.
I am looking for clarification on how this containment works when CUDA IPC is involved. Consider the following scenario:
-
Process A (Allocator): Allocates a large pool of GPU memory using
cudaMallocand shares handles viacudaIpcGetMemHandle. -
Process B (Consumer): Opens the handle via
cudaIpcOpenMemHandleand performs heavy compute/read/write operations. -
The Event: A DBE occurs on a memory page that Process B is currently accessing.
My Questions:
-
Context Poisoning: If Process B triggers the uncorrectable error and its CUDA context is “poisoned” (e.g., returns
cudaErrorDeviceUncorrectable), is Process A’s context also invalidated? Since Process A holds the primary allocation but wasn’t the one actively accessing the corrupted bit at that moment, does it remain functional unless it touches the corrupted memory? -
Error Visibility: Does the Allocator (Process A) have a programmatic way to identify which specific memory pages/offsets were corrupted? In a CPU shared-memory environment, one might use
mmapwithmadviseor SIGBUS signals to handle hardware poison; is there a CUDA equivalent to map a DBE event back to a specific IPC handle or virtual address?
Any insights or references to how the driver manages page-level retirement and containment for IPC-shared memory would be very much appreciated. Thank you.
Sincerely,
Geon-Woo