Dear community,
I replaced the CUDA memory allocation from cudaMalloc to cudaMallocManaged in c10/cuda/CUDACachingAllocator.cpp in the PyTorch open-source code, and successfully compiled it. It can be used as expected, and PyTorch version is v1.13.0 .
When training a GNN, I successfully oversubscribed the GPU memory.
When I use nsys to analyze a Python program, nsys profile --stats=true --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --trace=cuda --cuda-memory-usage=true --show-output=true python run_unittest.py
However there’s no occurrences of any page faults.
It shows GPU pages fault with UVM, but there’s still question here I want to ask:
Why there’s still no contain CUDA Unified Memory CPU page faults data here?
How to output the page fault memory address to the terminal? I know we can find it in the GUI, but we need to collect this data via some scripts for analysis, if we can output this info in csv it will be also better to use.
How to get the CUDA Kernel Statistics in both terminal and nsys-rep? What parameters should I use for nsys profiling execution?
I am not sure why it is not letting you upload the report file in your reply. Could you upload to google drive or one drive and share the link here or DM me?
And the conclusion was?
This thread is fascinating because I basically did the same thing but I wrote a shared library intercept of cudaMalloc and had it call cudaMallocManaged. And it actually let a new big text to video model I wanted to try run on my 5090. The model is bigger than 32GB’s. I’ve done a trick where I call prefetch, in a simple test case, and it indeed is faster than just letting pages fault in on demand. I certainly can get all the gpu addresses of the model’s tensors. But I want to have something which monitors the page faults, wanting only the addresses, and leverage that as I run my model many times.
I am not sure if I responded to these questions already over messages or if it fell off the radar.
For Q1, I examined both the reports and they show only GPU page faults. I can’t know for sure why there are no CPU page faults in the profiles of your application. When the CPU needs some data that is currently residing on the GPU, it causes a CPU page fault to occur. When that happens, you should see DtoH transfer under Unified memory in the timeline like shown in this example screenshot below. Do you expect this kind of data movement to occur in your application?
For Q2: We do not have any scripts that provide statistics on GPU page faults. If you need some sort of statistics output to the terminal for GPU page faults, you could write your own summary script similar to the reports/um_cpu_page_faults_sum.py, um_sum.py, or um_total_sum.py in the target folder of the nsys installation. They use SQLite queries to create the summaries. Can you explain what exactly you would want output to the terminal regarding GPU page faults? I can file a feature request and prioritize the work accordingly.
For Q3, Have you used nsys profile command with the –stats=true option? It should give you the CUDA kernel statistics output to the terminal at the end of the run. If you already have a nsys-rep file collected, then you could use the nsys stats command to generate the statistics. Please see the help text for the nsys stats command.
Please use the –cuda-um-cpu-page-faults=true and –cuda-gpu-page-faults=true CLI switches or the equivalent GUI controls to collect the page faults data. You should get information you are looking for here. See the example screenshots shown in User Guide — nsight-systems and User Guide — nsight-systems
Thanks but I didn’t want to involve a heavyweight tool from the py program I wanted to track faults in. While it took me a few times I did get a cupti fault tracker written which my py app can enable/disable. But then again I just figure out about 3 hours ago how to get a custom allocator working. Thus my module and all its layers will be in a nice large prefetchable single block of memory.