Is there a way to profile which GPU accesses each memory page throughout the entire application execution in a multi-GPU environment? Specifically, I’m looking for:
-
Page-level access tracking: Which GPU(s) access specific memory pages over time
-
Migration timeline: When pages migrate between GPUs and the reasons (on-touch, access counter-based, etc.)
More specifically,
- Track individual page access patterns from application start to finish
- Identify which GPU holds each page at different time intervals
Environment:
- nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:01:00.0 Off | 0 |
| N/A 26C P0 47W / 300W | 11094MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:25:00.0 Off | 0 |
| N/A 30C P0 ERR! / 300W | 133MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100 80GB PCIe Off | 00000000:C1:00.0 Off | 0 |
| N/A 23C P0 41W / 300W | 139MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100 80GB PCIe Off | 00000000:E1:00.0 Off | 0 |
| N/A 28C P0 45W / 300W | 2925MiB / 81920MiB | 28% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
- lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal
Is there a combination of NVIDIA profiling tools or APIs that can provide this level of page-access granularity? Any guidance on capturing and analyzing per-page GPU access patterns would be greatly appreciated.