Hi guys
I would like to trace every memory access (load/store) made by CUDA kernels on each device in a multi-GPU system, and obtain a time-ordered log of these accesses for all devices.
From my understanding:
-
CUPTI Activity API only traces memory copy, allocation, and unified memory events, but does not provide a way to record every individual memory access inside kernels.
-
CUPTI Metric/Event API can give aggregate statistics (e.g., total number of loads/stores), but not a full trace of each access.
-
NVBit allows for dynamic instrumentation at the instruction level, so it seems possible to log every memory access. However, it does not natively provide device ID or global time ordering across multiple GPUs. Additional work is required to correlate logs from different devices and ensure correct ordering (honestly, I can’t understand how nvbit exactly works).
-
Nsight Systems and Nsight Compute provide detailed profiling and some memory access statistics, but do not offer a full, time-ordered trace of all memory accesses per device.
My questions:
-
Is there any official or community-supported tool that can provide a complete, time-ordered trace of all memory accesses (not just copies or allocations) on each GPU in a multi-GPU setup?
-
Are there any best practices or references for this kind of fine-grained, multi-GPU memory access tracing?
Any advice or experience on this topic would be greatly appreciated. Thank you!