Where to find cpu/gpu pagefaults when using nsys?

I’m profiling my application with the following command on a GH200. However, when I view the generated nsys report, I am not able to spot the statistics for the page faults. Where might I find these?

nsys profile --cpu-core-metrics=0,2,14 --gpu-metrics-device=all --cuda-um-cpu-page-faults=true --cuda-um-gpu-page-faults=true --event-sample=system-wide

Thanks!

After you did the analysis, did you run the stats scripts to get that information?

User Guide — nsight-systems 2025.2 documentation (direct link to info about the CPU/CPU page faults script)

I had not run the script before but I see this with the script

python /home1/apps/nvidia/Linux_aarch64/24.9/profilers/Nsight_Systems/target-linux-sbsa-armv8/reports/um_cpu_page_faults_sum.py  report1.sqlite
report1.sqlite does not contain CUDA Unified Memory CPU page faults data.

For the data collection, I am running nsys on the GH host when there is an application running inside a singularity (apptainer) container. Nsys seems to not be able to capture the pagefault information when run outside the container. Are there any specific settings needed for nsys to be able to monitor the pagefaults caused by a application running inside a container?

So you are using Nsight Systems outside of the container to analyze things inside the container?

It is far better to use the Nsight Systems cli inside the container. See User Guide — nsight-systems 2025.2 documentation

@liuyis I used your advice from the other thread to get the CUDA runtime events by attaching nsys to the background process, but I still am not able to see the page fault data while profiling inside the container. Could you please have a look at the nsys-rep file?

The page fault data that Nsys could collect is for CUDA Unified Memory feature specifically. Does the application actually uses CUDA UVM? On a search of the CUDA API calls, I don’t see functions like cudaMallocManaged() being used.

Hi,

I have a different application mode that uses UVA implementation of vLLM.

Nsys rep attachment

In this trace, I noticed

  1. a “UVM GPU1 BH” process
  2. cudaHostAlloc() calls that should be directly accessible by the device

but still no information about CPU/GPU page faults that may have occurred was captured in the trace. Could you please let me know what to expect here? I’m also not sure what the “UVM GPU1 BH” process is referring to since it shows high utilization

Thanks!

UVA looks like a different CUDA feature than UVM, according to Unified Memory in CUDA 6 | NVIDIA Technical Blog and unified virtual addressing (UVA) vs. unified memory: perceived difference.

The Nsys CPU/GPU page fault trace support is for UVM. Here is an example using the sample app cuda-samples/Samples/6_Performance/UnifiedMemoryPerf at master · NVIDIA/cuda-samples · GitHub

The UVM GPU1 BH thread is a kernel thread which does not belong to the target application. I’m also not very familiar it, but I can find some similar questions mentioning it like UVM GPU1 BH process causing 100% CPU after standby and While training with tensorflow RTX8000 with NVLINK loses with error message. "GPU has fallen off the bus.". It is probably related to the driver, so I’d suggest checking with forums like CUDA - NVIDIA Developer Forums

@liuyis thank you for your response clarifying the differences between UVA and UVM. Also, thank you for the sample. I am able to confirm that I can profile within a container.

On a final note,

From nsys tooling reports perspective, some of these seem ambiguous. For example, I see the following recommendation identifies the memcpy regions as pageable memory. I am not sure if this means that there were no pages migrated by the driver or if there were no page faults that occurred at the time of access. Could you please clarify this?

For example, I see the following recommendation identifies the memcpy regions as pageable memory.

This is about a specific optimization suggestion, using pinned memory instead of pageable memory can allow asynchronous memcpies and overlapped memcpy & compute. This blog post has more details, although it’s an old one: https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/#pinned_host_memory