Multi-GPU UVM Page Access Profiling: Tracking page-to-GPU access patterns throughout application lifecycle

Is there a way to profile which GPU accesses each memory page throughout the entire application execution in a multi-GPU environment? Specifically, I’m looking for:

  1. Page-level access tracking: Which GPU(s) access specific memory pages over time

  2. Migration timeline: When pages migrate between GPUs and the reasons (on-touch, access counter-based, etc.)

More specifically,

  • Track individual page access patterns from application start to finish
  • Identify which GPU holds each page at different time intervals

Environment:

  • nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:01:00.0 Off |                    0 |
| N/A   26C    P0             47W /  300W |   11094MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off |   00000000:25:00.0 Off |                    0 |
| N/A   30C    P0            ERR! /  300W |     133MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          Off |   00000000:C1:00.0 Off |                    0 |
| N/A   23C    P0             41W /  300W |     139MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe          Off |   00000000:E1:00.0 Off |                    0 |
| N/A   28C    P0             45W /  300W |    2925MiB /  81920MiB |     28%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
  • lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal

Is there a combination of NVIDIA profiling tools or APIs that can provide this level of page-access granularity? Any guidance on capturing and analyzing per-page GPU access patterns would be greatly appreciated.

You might wish to ask profiler-specific questions on one of the profiler forums e.g. here. It looks like nsys has some information available. Have you tried it?

1 Like

I know nsys supports profiling page faults but I need to know every page access from each of GPUs since what I want is the degree of data sharing across multiple GPUs.