[Question] NSys CUDA Profiler - Page fault size

I am trying to create a trace of CPU and GPU page faults under unified memory. I ran nsys using a command from a previous question:

nsys profile --force-overwrite=true --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --export=sqlite ./add_vectors

I then queried the sqlite database for a trace of both CPU and GPU page faults. However, it seems like the CPU page faults do not include prefetches, only the demand accesses. The example below shows the initialization of two vectors. Odd numbered accesses increase by 4KB, then 8KB, all the way up to 2MB (it is the same for even numbered accesses):

Is there any way to get a log of all page faults, including prefetches?

Moved to the Nsight Systems category.

I do not believe so, but I am going to loop in the engineer that developed the page fault trace.

@skottapalli can you comment?

Nsight Systems gets the Unified memory CPU page fault events from CUPTI. See 6.95. CUpti_ActivityUnifiedMemoryCounter2 — Cupti 12.6 documentation
From what I understand the prefetch operations are user provided hints and not considered as page faults, so it will not show up as a page fault.

I think you may be wanting to see the DtoH transfers for prefetches. If so, please take a look at the “DtoH transfer” timeline row under “Managed Memory” or “Unified Memory” timeline row. See the second screenshot in User Guide — nsight-systems 2024.5 documentation
You should see DtoH transfer events with the migration cause listed as prefetch.

Thanks for the links! In the code snippet I posted, I am allocating two vectors and then initializing them. I am looking for the page faults on the CPU that create a page table entry. More specifically, if you look at the CPU UM fault trace, the page offsets don’t necessarily increase by 4KB, indicating that some pages are being “pre-created”. I used the word prefetch before because the behavior seems similar to the tree-based prefetching scheme used in CPU/GPU UM transfers.

My question is is there a way of knowing how many pages on the CPU are being initialized when there is a page fault and the page does not exist anywhere yet? Looking at the CUpti documentation link, I don’t see “size” being listed as a public member. Do you know of any other way I can get this information?

I don’t know of a way to get the size on initialization of pages when a page does not exist yet. However, on subsequent page faults, you could find the corresponding DtoH transfer event and it will contain the size of the transfer.

See the attached screenshot. The first 3 CPU page faults are due to initialization. The next 3 page faults are when the CPU needs to access the pages from the GPU, so DtoH transfer takes place. Those events has the size information.