The implementation doesn’t appear to be making use of managed memory (where page faults might occur):
I try to re-implement CPU offloading in a fully transparent way: we offload the tensor to CPU, and let GPU directly view it as GPU tensor. It depends on UVA technology (no clear documentation, but there’re some public discussions), and per my discussion with nvidia experts, it works for systems with pinned memory.
I don’t have any info on the UVM GPU1 BH process, but it doesn’t appear to be unique to anything you’ve mentioned.