How to output uvm page fault memory address to the terminal via using nsys 2024.1.?

Dear community,
I replaced the CUDA memory allocation from cudaMalloc to cudaMallocManaged in c10/cuda/CUDACachingAllocator.cpp in the PyTorch open-source code, and successfully compiled it. It can be used as expected, and PyTorch version is v1.13.0 .

When training a GNN, I successfully oversubscribed the GPU memory.

When I use nsys to analyze a Python program,
nsys profile --stats=true --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --trace=cuda --cuda-memory-usage=true --show-output=true python run_unittest.py

However there’s no occurrences of any page faults.

Then I wrote a test python program using my modified pytorch.

import torch
import torch.nn as nn
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv, SAGEConv, GATConv

import torch_geometric.transforms as T
from torch_geometric.logging import init_wandb, log
import torch_sparse,torch_scatter

# 创建一个大型神经网络
model = nn.Sequential(
    nn.Linear(1000, 10000),
    nn.ReLU(),
    nn.Linear(10000, 10000),
    nn.ReLU(),
    nn.Linear(10000, 10000),
    nn.ReLU(),
    nn.Linear(1000, 100)
)

# 将模型移动到 GPU 上
model = model.cuda()

# 持续运行网络以占用 GPU 显存
# while True:
for i in range (1):
    # 创建一个随机输入
    input_data = torch.randn(100, 1000).cuda()
    print("start...")
    # 在 GPU 上进行前向传播计算
    output = model(input_data)
    
    print("Output:", output)
    # 显示当前 GPU 显存占用情况
    print("GPU memory allocated:", torch.cuda.memory_allocated() / (1024 ** 3), "GB")
    i+=1
    

It shows GPU pages fault with UVM, but there’s still question here I want to ask:

  1. Why there’s still no contain CUDA Unified Memory CPU page faults data here?
  2. How to output the page fault memory address to the terminal? I know we can find it in the GUI, but we need to collect this data via some scripts for analysis, if we can output this info in csv it will be also better to use.
  3. How to get the CUDA Kernel Statistics in both terminal and nsys-rep? What parameters should I use for nsys profiling execution?

The “nsys profile --stats” option only exports a default set of items. So I am not surprised that it isn’t part of the default output.

@jasoncohen has this been added to the sqlite export or is it only visible in the GUI in tooltips?

Hi @zwu065 - could you please share the report file and provide us with the output of nvidia-smi command on the target system?


How could I share the report file? It seems the nsys-rep is not a legal format to upload in this blog

I am not sure why it is not letting you upload the report file in your reply. Could you upload to google drive or one drive and share the link here or DM me?


When I try to upload, it gives me error like this.

https://drive.google.com/drive/folders/1u8nfsGGbOutBka97LBuLMOzjlV0gB-R0?usp=drive_link Can you access it?

I have requested access to the google drive link you shared. Please accept it.

Done. Please check it.

And the conclusion was?
This thread is fascinating because I basically did the same thing but I wrote a shared library intercept of cudaMalloc and had it call cudaMallocManaged. And it actually let a new big text to video model I wanted to try run on my 5090. The model is bigger than 32GB’s. I’ve done a trick where I call prefetch, in a simple test case, and it indeed is faster than just letting pages fault in on demand. I certainly can get all the gpu addresses of the model’s tensors. But I want to have something which monitors the page faults, wanting only the addresses, and leverage that as I run my model many times.

I am not sure if I responded to these questions already over messages or if it fell off the radar.

For Q1, I examined both the reports and they show only GPU page faults. I can’t know for sure why there are no CPU page faults in the profiles of your application. When the CPU needs some data that is currently residing on the GPU, it causes a CPU page fault to occur. When that happens, you should see DtoH transfer under Unified memory in the timeline like shown in this example screenshot below. Do you expect this kind of data movement to occur in your application?

For Q2: We do not have any scripts that provide statistics on GPU page faults. If you need some sort of statistics output to the terminal for GPU page faults, you could write your own summary script similar to the reports/um_cpu_page_faults_sum.py, um_sum.py, or um_total_sum.py in the target folder of the nsys installation. They use SQLite queries to create the summaries. Can you explain what exactly you would want output to the terminal regarding GPU page faults? I can file a feature request and prioritize the work accordingly.

For Q3, Have you used nsys profile command with the –stats=true option? It should give you the CUDA kernel statistics output to the terminal at the end of the run. If you already have a nsys-rep file collected, then you could use the nsys stats command to generate the statistics. Please see the help text for the nsys stats command.

Please use the –cuda-um-cpu-page-faults=true and –cuda-gpu-page-faults=true CLI switches or the equivalent GUI controls to collect the page faults data. You should get information you are looking for here. See the example screenshots shown in User Guide — nsight-systems and User Guide — nsight-systems

Thanks but I didn’t want to involve a heavyweight tool from the py program I wanted to track faults in. While it took me a few times I did get a cupti fault tracker written which my py app can enable/disable. But then again I just figure out about 3 hours ago how to get a custom allocator working. Thus my module and all its layers will be in a nice large prefetchable single block of memory.