Is it possible to trace all memory accesses on each device in a multi-GPU environment, in time order, using CUPTI, NVBit, or any other tool?

tmdrl583205 · July 12, 2025, 9:20am

Hi guys

I would like to trace every memory access (load/store) made by CUDA kernels on each device in a multi-GPU system, and obtain a time-ordered log of these accesses for all devices.

From my understanding:

CUPTI Activity API only traces memory copy, allocation, and unified memory events, but does not provide a way to record every individual memory access inside kernels.
CUPTI Metric/Event API can give aggregate statistics (e.g., total number of loads/stores), but not a full trace of each access.
NVBit allows for dynamic instrumentation at the instruction level, so it seems possible to log every memory access. However, it does not natively provide device ID or global time ordering across multiple GPUs. Additional work is required to correlate logs from different devices and ensure correct ordering (honestly, I can’t understand how nvbit exactly works).
Nsight Systems and Nsight Compute provide detailed profiling and some memory access statistics, but do not offer a full, time-ordered trace of all memory accesses per device.

My questions:

Is there any official or community-supported tool that can provide a complete, time-ordered trace of all memory accesses (not just copies or allocations) on each GPU in a multi-GPU setup?
Are there any best practices or references for this kind of fine-grained, multi-GPU memory access tracing?

Any advice or experience on this topic would be greatly appreciated. Thank you!

veraj · September 29, 2025, 3:53am

Hi, @tmdrl583205

Sorry for the late response. Checking internally, it seems Snoopie is a better choice.

Here is the paper: https://dl.acm.org/doi/abs/10.1145/3650200.3656597

And the source code. It detects multi-GPU communication (including NCCL and NVSHMEM).

Topic		Replies	Views
Logging the trace of memory accesses in the GPU trace logging CUDA Programming and Performance	5	7537	December 5, 2007
Multi-gpu timing/profiling CUDA Programming and Performance	8	2620	January 13, 2010
How to get timeline trace of memcpy and execution on Linux? which tool/script/trick to use? CUDA Programming and Performance	1	3693	August 21, 2011
Determine if Memory Accesses are Being Coalesced CUDA Programming and Performance	3	1325	May 12, 2010
GPU Trace library Easily trace vaules from your kernels in device mode! CUDA Programming and Performance	11	18982	October 11, 2009
Visual Profiler displays erroneous output with multiple GPUs Profiler problem on multi-gpu scaling b CUDA Programming and Performance	0	809	May 9, 2012
How can I tell if my memory accesses are being coalesced? CUDA Programming and Performance	5	1325	June 23, 2009
GPU-CPU & GPU-GPU synchronization query on advanced CUDA features CUDA Programming and Performance	12	17514	June 14, 2008
Bugs in the profiler 1.0? CUDA Programming and Performance	2	3140	September 6, 2008
Simultaneous execution of multiple kernels CUDA Programming and Performance	4	2638	December 24, 2008

Is it possible to trace all memory accesses on each device in a multi-GPU environment, in time order, using CUPTI, NVBit, or any other tool?

Related topics