How to output uvm page fault memory address to the terminal via using nsys 2024.1.?

zwu065 · February 10, 2024, 3:35pm

Dear community,
I replaced the CUDA memory allocation from cudaMalloc to cudaMallocManaged in c10/cuda/CUDACachingAllocator.cpp in the PyTorch open-source code, and successfully compiled it. It can be used as expected, and PyTorch version is v1.13.0 .

When training a GNN, I successfully oversubscribed the GPU memory.

When I use nsys to analyze a Python program,
nsys profile --stats=true --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --trace=cuda --cuda-memory-usage=true --show-output=true python run_unittest.py

However there’s no occurrences of any page faults.

Then I wrote a test python program using my modified pytorch.

import torch
import torch.nn as nn
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv, SAGEConv, GATConv

import torch_geometric.transforms as T
from torch_geometric.logging import init_wandb, log
import torch_sparse,torch_scatter

# 创建一个大型神经网络
model = nn.Sequential(
    nn.Linear(1000, 10000),
    nn.ReLU(),
    nn.Linear(10000, 10000),
    nn.ReLU(),
    nn.Linear(10000, 10000),
    nn.ReLU(),
    nn.Linear(1000, 100)
)

# 将模型移动到 GPU 上
model = model.cuda()

# 持续运行网络以占用 GPU 显存
# while True:
for i in range (1):
    # 创建一个随机输入
    input_data = torch.randn(100, 1000).cuda()
    print("start...")
    # 在 GPU 上进行前向传播计算
    output = model(input_data)
    
    print("Output:", output)
    # 显示当前 GPU 显存占用情况
    print("GPU memory allocated:", torch.cuda.memory_allocated() / (1024 ** 3), "GB")
    i+=1

It shows GPU pages fault with UVM, but there’s still question here I want to ask:

Why there’s still no contain CUDA Unified Memory CPU page faults data here?
How to output the page fault memory address to the terminal? I know we can find it in the GUI, but we need to collect this data via some scripts for analysis, if we can output this info in csv it will be also better to use.
How to get the CUDA Kernel Statistics in both terminal and nsys-rep? What parameters should I use for nsys profiling execution?

hwilper · February 12, 2024, 8:25pm

The “nsys profile --stats” option only exports a default set of items. So I am not surprised that it isn’t part of the default output.

@jasoncohen has this been added to the sqlite export or is it only visible in the GUI in tooltips?

skottapalli · February 14, 2024, 11:53pm

Hi @zwu065 - could you please share the report file and provide us with the output of nvidia-smi command on the target system?

zwu065 · February 15, 2024, 2:12am

How could I share the report file? It seems the nsys-rep is not a legal format to upload in this blog

skottapalli · February 15, 2024, 2:16am

I am not sure why it is not letting you upload the report file in your reply. Could you upload to google drive or one drive and share the link here or DM me?

zwu065 · February 15, 2024, 2:50am

When I try to upload, it gives me error like this.

zwu065 · February 15, 2024, 2:55am

https://drive.google.com/drive/folders/1u8nfsGGbOutBka97LBuLMOzjlV0gB-R0?usp=drive_link Can you access it?

skottapalli · February 15, 2024, 4:08pm

I have requested access to the google drive link you shared. Please accept it.

zwu065 · February 15, 2024, 5:05pm

Done. Please check it.

hexexpert5 · October 25, 2025, 1:35am

And the conclusion was?
This thread is fascinating because I basically did the same thing but I wrote a shared library intercept of cudaMalloc and had it call cudaMallocManaged. And it actually let a new big text to video model I wanted to try run on my 5090. The model is bigger than 32GB’s. I’ve done a trick where I call prefetch, in a simple test case, and it indeed is faster than just letting pages fault in on demand. I certainly can get all the gpu addresses of the model’s tensors. But I want to have something which monitors the page faults, wanting only the addresses, and leverage that as I run my model many times.

skottapalli · October 27, 2025, 7:04pm

I am not sure if I responded to these questions already over messages or if it fell off the radar.

For Q1, I examined both the reports and they show only GPU page faults. I can’t know for sure why there are no CPU page faults in the profiles of your application. When the CPU needs some data that is currently residing on the GPU, it causes a CPU page fault to occur. When that happens, you should see DtoH transfer under Unified memory in the timeline like shown in this example screenshot below. Do you expect this kind of data movement to occur in your application?

For Q2: We do not have any scripts that provide statistics on GPU page faults. If you need some sort of statistics output to the terminal for GPU page faults, you could write your own summary script similar to the reports/um_cpu_page_faults_sum.py, um_sum.py, or um_total_sum.py in the target folder of the nsys installation. They use SQLite queries to create the summaries. Can you explain what exactly you would want output to the terminal regarding GPU page faults? I can file a feature request and prioritize the work accordingly.

For Q3, Have you used nsys profile command with the –stats=true option? It should give you the CUDA kernel statistics output to the terminal at the end of the run. If you already have a nsys-rep file collected, then you could use the nsys stats command to generate the statistics. Please see the help text for the nsys stats command.

skottapalli · October 27, 2025, 7:07pm

Please use the –cuda-um-cpu-page-faults=true and –cuda-gpu-page-faults=true CLI switches or the equivalent GUI controls to collect the page faults data. You should get information you are looking for here. See the example screenshots shown in User Guide — nsight-systems and User Guide — nsight-systems

hexexpert5 · October 28, 2025, 4:10am

Thanks but I didn’t want to involve a heavyweight tool from the py program I wanted to track faults in. While it took me a few times I did get a cupti fault tracker written which my py app can enable/disable. But then again I just figure out about 3 hours ago how to get a custom allocator working. Thus my module and all its layers will be in a nice large prefetchable single block of memory.

Topic		Replies	Views
Does not contain CUDA Unified Memory CPU page faults data Profiling Linux Targets	16	1339	March 8, 2024
Where to find cpu/gpu pagefaults when using nsys? Profiling Linux Targets	10	334	May 7, 2025
How to Display CPU and GPU Page Faults in NSYS Output for Unified Memory on Grace Hopper? Profiling Linux Targets cuda	7	408	January 29, 2025
Nsight system not report unified memory page fault statistics in summery Profiling Linux Targets nsight	3	1989	March 29, 2024
--cuda-um-gpu-page-faults and --cuda-um-cpu-page-faults is not regocnized Profiling x86 Windows Targets cuda , nsight	6	333	September 1, 2024
[Question] NSys CUDA Profiler - Page fault size Profiling Linux Targets	5	152	August 28, 2024
Detail page fault tracking via nsys Profiling Linux Targets	2	657	February 14, 2024
Page fault profiling Profiling Linux Targets	2	1082	September 6, 2023
[Question] NSys CUDA Profiler - Page Migration and Number of CPU/GPU page faults Profiling Linux Targets cuda , profiling	1	1068	June 23, 2023
Nsys Profiler- Wrong event order Profiling Linux Targets	8	1833	April 1, 2024

How to output uvm page fault memory address to the terminal via using nsys 2024.1.?

Related topics