According to Confidential Compute on NVIDIA Hopper H100 → Running a Confidential Compute Application on the GPU → Developer Considerations, memory allocations by cudaMallocHost would be handled by UVM, like cudaManagedAlloc.
When going from a CPU buffer to a pinned GPU buffer, CUDA UMD will encrypt and stage into a bounce buffer and have the GPU decrypt and pull into the TCB.
UVM will only be triggered if the pointer was via cudaHostAlloc/cudaMallocHost/cudaMallocManaged, in which case, the UVM driver does the encrypt+bounce+DMA+decrypt sequence.
By the way, may I ask @rnertney , is there a large performance difference between using CUDA UMD and using UVM? It might suggest whether app developers should use malloc or cudaHostAlloc for CPU-side memory allocation for DMA under CC.
I do not have access to H100 hardware, and therefore cannot evaluate it by myself.
UVM handles memory automatically and does migrations based on page-faults. A carefully coded application + manual data-movement (i.e., without UVM) can perform better than UVM in the more complicated data-movement scenarios. However, UVM is very powerful and has the benefit of a much easier code writing. Take a look at the intro here: https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
Comparisons of speed between standard movement would vary based on the complexity of your movement flows; UVM is pretty close to ideal in many use-cases.
malloc + cudaHostRegister() is a forbidden API because the GPU DMAs cannot directly access VM memory when the CPUs have isolated it. In HCC mode, all memory being moved into the GPU will need to have a cu_ prefix such that our driver may intercept and encrypt it.