Question about Fault Handling in GPU Driver (especially UVM)

Hi folks, I’m currently studying how GPU faults are handled and I’m trying to understand whether there is a practical way to trigger a UVM non-replayable fault.

As I understand it, UVM categorizes faults into replayable and non-replayable. Roughly speaking, faults coming from the Graphics Engine (SM) are replayable, while faults coming from the Copy Engine or PBDMA are non-replayable. So far, the only detailed explanation I’ve found is in the comments inside kernel-open/nvidia-uvm/uvm_gpu_non_replayable_faults.c (if there is any official documentation elsewhere, I’d really appreciate pointers).

The comment gives an example:
“An example of a Copy Engine non-replayable fault is a memory copy between two virtual addresses on a GPU, in which either the source or destination pointers are not currently mapped to a physical address in the page tables of the GPU.”

I tried to reproduce this in two ways:

  • Using cudaMallocManaged and then applying cuMemAdvise to make the destination pages preferred on the CPU, this way does not guarantee the physical page on GPU has been evicted.
  • Using the VMM API (cuMemCreate etc.) to create a valid GPU VA range without backing it with physical memory, this way should guarantee it.

But none of these attempts triggered a non-replayable fault. I monitored schedule_non_replayable_faults_handler in kernel-open/nvidia-uvm/uvm_gpu_isr.c and it never returned one(means one handler is scheduled). Instead, for the first way, i only got replayable fault, because UVM trying to migrate page from CPU to GPU. For the second way, I only got a segmentation fault from the CPU side :(

Before I keep digging, I wanted to ask:
Has anyone successfully triggered a UVM non-replayable fault, or has insights into conditions that reliably cause one?
Any suggestions or thoughts would be greatly appreciated!