Dear community,
I replaced the CUDA memory allocation from cudaMalloc to cudaMallocManaged in c10/cuda/CUDACachingAllocator.cpp in the PyTorch open-source code, and successfully compiled it. It can be used as expected, and PyTorch version is v1.13.0 .
When training a GNN, I successfully oversubscribed the GPU memory.
When I use nsys to analyze a Python program,
nsys profile --stats=true --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --trace=cuda --cuda-memory-usage=true --show-output=true python run_unittest.py
However there’s no occurrences of any page faults.
Then I wrote a test python program using my modified pytorch.
import torch
import torch.nn as nn
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv, SAGEConv, GATConv
import torch_geometric.transforms as T
from torch_geometric.logging import init_wandb, log
import torch_sparse,torch_scatter
# 创建一个大型神经网络
model = nn.Sequential(
nn.Linear(1000, 10000),
nn.ReLU(),
nn.Linear(10000, 10000),
nn.ReLU(),
nn.Linear(10000, 10000),
nn.ReLU(),
nn.Linear(1000, 100)
)
# 将模型移动到 GPU 上
model = model.cuda()
# 持续运行网络以占用 GPU 显存
# while True:
for i in range (1):
# 创建一个随机输入
input_data = torch.randn(100, 1000).cuda()
print("start...")
# 在 GPU 上进行前向传播计算
output = model(input_data)
print("Output:", output)
# 显示当前 GPU 显存占用情况
print("GPU memory allocated:", torch.cuda.memory_allocated() / (1024 ** 3), "GB")
i+=1
It shows GPU pages fault with UVM, but there’s still question here I want to ask:
- Why there’s still no contain CUDA Unified Memory CPU page faults data here?
- How to output the page fault memory address to the terminal? I know we can find it in the GUI, but we need to collect this data via some scripts for analysis, if we can output this info in csv it will be also better to use.
- How to get the CUDA Kernel Statistics in both terminal and nsys-rep? What parameters should I use for nsys profiling execution?



