Question about Fault Handling in GPU Driver (especially UVM)

Ryan_Liu · November 25, 2025, 3:40pm

Hi folks, I’m currently studying how GPU faults are handled and I’m trying to understand whether there is a practical way to trigger a UVM non-replayable fault.

As I understand it, UVM categorizes faults into replayable and non-replayable. Roughly speaking, faults coming from the Graphics Engine (SM) are replayable, while faults coming from the Copy Engine or PBDMA are non-replayable. So far, the only detailed explanation I’ve found is in the comments inside kernel-open/nvidia-uvm/uvm_gpu_non_replayable_faults.c (if there is any official documentation elsewhere, I’d really appreciate pointers).

The comment gives an example:
“An example of a Copy Engine non-replayable fault is a memory copy between two virtual addresses on a GPU, in which either the source or destination pointers are not currently mapped to a physical address in the page tables of the GPU.”

I tried to reproduce this in two ways:

Using cudaMallocManaged and then applying cuMemAdvise to make the destination pages preferred on the CPU, this way does not guarantee the physical page on GPU has been evicted.
Using the VMM API (cuMemCreate etc.) to create a valid GPU VA range without backing it with physical memory, this way should guarantee it.

But none of these attempts triggered a non-replayable fault. I monitored schedule_non_replayable_faults_handler in kernel-open/nvidia-uvm/uvm_gpu_isr.c and it never returned one(means one handler is scheduled). Instead, for the first way, i only got replayable fault, because UVM trying to migrate page from CPU to GPU. For the second way, I only got a segmentation fault from the CPU side :(

Before I keep digging, I wanted to ask:
Has anyone successfully triggered a UVM non-replayable fault, or has insights into conditions that reliably cause one?
Any suggestions or thoughts would be greatly appreciated!

Topic		Replies	Views
Is it possible to reset GPU w/o rebooting? CUDA Programming and Performance	2	1698	November 3, 2009
GPU in a bad state - only power cycle helps CUDA Programming and Performance	6	2289	March 24, 2011
Concurrent Copy and Execution, and Page-Locked Memory Mapping CUDA Programming and Performance	0	4445	April 25, 2010
Unified Memory - Page Fault Handling CUDA Programming and Performance	2	5435	July 16, 2018
copy memory after a kernel crash CUDA Programming and Performance	0	991	April 11, 2009
Page migration engine in UM CUDA Programming and Performance	5	2006	November 17, 2018
Simulate GPU Failure CUDA Programming and Performance	1	1365	May 23, 2016
Is there a way to reset a GPU?... ...that is, without rebooting Linux CUDA Programming and Performance	7	3162	October 20, 2010
GPU in state where results are not reproducible! CUDA Programming and Performance	50	17018	November 2, 2012
Is there a way to find the memory visit error in GPU? CUDA Programming and Performance	2	1185	August 13, 2009

Question about Fault Handling in GPU Driver (especially UVM)

Related topics