ECC Error Containment Behavior with CUDA IPC

Hi everyone,

I am investigating the error containment behavior of NVIDIA GPUs (specifically A100/H100 architectures) as outlined in the NVIDIA GPU Memory Error Management documentation.

The documentation states that certain uncorrectable ECC errors (e.g, Double-Bit ECC Error (DBE)) can be “contained” to the specific application that owns the corrupted GPU memory, preventing a full GPU reset or impact on unrelated processes.

I am looking for clarification on how this containment works when CUDA IPC is involved. Consider the following scenario:

  1. Process A (Allocator): Allocates a large pool of GPU memory using cudaMalloc and shares handles via cudaIpcGetMemHandle.

  2. Process B (Consumer): Opens the handle via cudaIpcOpenMemHandle and performs heavy compute/read/write operations.

  3. The Event: A DBE occurs on a memory page that Process B is currently accessing.

My Questions:

  • Context Poisoning: If Process B triggers the uncorrectable error and its CUDA context is “poisoned” (e.g., returns cudaErrorDeviceUncorrectable), is Process A’s context also invalidated? Since Process A holds the primary allocation but wasn’t the one actively accessing the corrupted bit at that moment, does it remain functional unless it touches the corrupted memory?

  • Error Visibility: Does the Allocator (Process A) have a programmatic way to identify which specific memory pages/offsets were corrupted? In a CPU shared-memory environment, one might use mmap with madvise or SIGBUS signals to handle hardware poison; is there a CUDA equivalent to map a DBE event back to a specific IPC handle or virtual address?

Any insights or references to how the driver manages page-level retirement and containment for IPC-shared memory would be very much appreciated. Thank you.

Sincerely,
Geon-Woo

1 Like