Tracking page accesses from Multi-GPUs in detail

Hello NVIDIA Developer Community,

I’m working on optimizing a multi-GPU application using Unified Virtual Memory (UVM) and need to profile detailed page access patterns. I’m looking for ways to track which GPU accesses specific memory pages throughout the entire application execution.

Environment

2x RTX 3090
CUDA Version: 12.6
Driver Version: 560.35.03
Ubuntu 22.04

What I’ve tried so far:

  • CUPTI Activity API with CUPTI_ACTIVITY_KIND_UNIFIED_MEMORY_COUNTER
  • Nsight Systems with --cuda-um-gpu-page-faults=true

Specific Questions

  1. Is there a way to get page-level granularity for UVM access tracking?
  2. Can I identify which specific GPU is accessing which memory page at any given time?
  3. Are there any tools or APIs that provide real-time page migration monitoring (except for what I’ve tried)?
  4. What’s the best approach for tracking UVM page ownership changes in multi-GPU scenarios?
  5. Is it possible to simply print the deviceId by passing it as a parameter of kernel function from the host?

Any guidance on tools, APIs, or methodologies would be greatly appreciated!

This is beyond what Nsight Systems offers.

@mstrengert does Nsight Compute have anything that can help or do you have another suggestion?

May I ask why you think you need such low level details for your optimization?

Thanks for your reply.

I’ve been looking for this since I want to know how much of data are shared across GPUs through application’s lifecycle.

Also, I need to know wheter the evicted page to host was used by other GPUs or not.

I’m wondering how much of this you could get through our recipe system (User Guide — nsight-systems 2025.3 documentation) (this is a direct link to the section on recipes, the forum software just munges link titles).

Specifically, I am thinking of these two recipes:

It may very well be that we have all the data you need, you just need to use the recipes (or modify them, they are python scripts) to pull it out in a useful way.

I mean that we don’t have time in here, but we had time in the data.

I used that option but that doesn’t give me much more deeper information than CUPTI recorded with CUpti_ActivityUnifiedMemoryCounter.

For example, CUPTI gives me the reason of page migration or if the memory access was read or write.

Is it possible that collecting such detailed data with nsys?

Unfortunately we are using CUPTI under the covers, so yeah, you aren’t going to get a lot more.