`perf mem` equivalent for CUDA monitoring exists?

I want to monitor software overhead.
For this purpose, I want to use perf mem equivalent tool.
From seeing Nsight Compute Kernel Profiling Guide,
Whole CUDA Kernel memory accessing behavior can take.
But it cannnot see the functions addressing.

Reference

You can use the compute sanitizer API for this purpose. A memory tracker example can be found here: compute-sanitizer-samples/MemoryTracker at master · NVIDIA/compute-sanitizer-samples · GitHub

1 Like

Thank you for your suggestion. It seems a library.
Is there any document for library api or sample code which uses MemoryTracker library?

Both the API documentation and the sample code are linked in my previous reply already.

Let me confirm, MemoryTracker is available only on source code?
Since compute-sanitizer command just supports Memcheck, RaceCheck, Initcheck and Synccheck.

Refernce

1.4. Compute Sanitizer Tools

MemoryTracker is an example code using the compute sanitizer patching API. The example is available only as source code. We do not provide a pre-built binary for it.

1 Like

When I use LD_PRELOAD with following command, The outputs are follows.
Is there any additional information exists?

$ LD_PRELOAD=~/compute-sanitizer-samples/MemoryTracker/libMemoryTracker.so:/usr/local/cuda-11.3/compute-sanitizer/libsanitizer-public.so ./a.out
Kernel Launch: _Z8kernel_APdii
  Memory accesses: 0
Kernel Launch: _Z8kernel_CPdPKdi
  Memory accesses: 0

In the above, following code are used.

Which GPU are you using? I suspect you are running code on an SM architecture that is not explicitly supported by the sample build configuration. Locally, I have:

$ LD_PRELOAD=./libMemoryTracker.so LD_LIBRARY_PATH=$SANITIZER_LIBRARIES ./test | grep -B1 -A5 'Memory accesses'
Kernel Launch: _Z8kernel_APdii
  Memory accesses: 1024
  [0] Read access of global memory by thread (32,0,0) at address 0x7f7828015b00 (size is 8 bytes)
  [1] Read access of global memory by thread (32,0,0) at address 0x7f7828010100 (size is 8 bytes)
  [2] Read access of global memory by thread (0,0,0) at address 0x7f782800c800 (size is 8 bytes)
  [3] Read access of global memory by thread (32,0,0) at address 0x7f7828014300 (size is 8 bytes)
  [4] Read access of global memory by thread (32,0,0) at address 0x7f7828005f00 (size is 8 bytes)
--
Kernel Launch: _Z8kernel_BPdii
  Memory accesses: 1024
  [0] Read access of global memory by thread (0,0,0) at address 0x7f7828003e00 (size is 8 bytes)
  [1] Read access of global memory by thread (0,0,0) at address 0x7f7828005600 (size is 8 bytes)
  [2] Read access of global memory by thread (0,0,0) at address 0x7f7828001000 (size is 8 bytes)
  [3] Read access of global memory by thread (0,0,0) at address 0x7f7828005400 (size is 8 bytes)
  [4] Read access of global memory by thread (0,0,0) at address 0x7f7828002800 (size is 8 bytes)
--
Kernel Launch: _Z8kernel_CPdPKdi
  Memory accesses: 1024
  [0] Read access of global memory by thread (32,0,0) at address 0x7f782000b028 (size is 8 bytes)
  [1] Read access of global memory by thread (32,0,0) at address 0x7f7820015030 (size is 8 bytes)
  [2] Read access of global memory by thread (33,0,0) at address 0x7f782000b0a8 (size is 8 bytes)
  [3] Read access of global memory by thread (33,0,0) at address 0x7f78200150b0 (size is 8 bytes)
  [4] Read access of global memory by thread (34,0,0) at address 0x7f782000b128 (size is 8 bytes)

We will consider adding more GPU architectures to the build in our samples. In the meantime, feel free to manually add your SM architecture to MemoryTracker/Makefile. If you have additional questions, please let me know!

Thank you for your comments.

I am using GeForce RTX 2070 (CC7.5) (with CUDA 11.3) and Tesla A100 (CC8.0) (with CUDA 11.4). But the output keeps the same. Even I change SMS parameter to 80.
Is there any additional option to compile MemoryTracker ?

$ make
g++ -I/usr/local/cuda/include -I/usr/local/cuda/compute-sanitizer/include -L/usr/local/cuda/compute-sanitizer -fPIC -shared -o libMemoryTracker.so MemoryTracker.cpp -lsanitizer-public
$ $ LD_PRELOAD=~/sakaia/compute-sanitizer-samples/MemoryTracker/libMemoryTracker.so:/usr/local/cuda-11.4/compute-sanitizer/libsanitizer-public.so ./a.out
Kernel Launch: _Z8kernel_APdii
  Memory accesses: 0
Kernel Launch: _Z8kernel_BPdii
  Memory accesses: 0
Kernel Launch: _Z8kernel_CPdPKdi
  Memory accesses: 0

I forgot to mention that you need to do cd ~/sakaia/compute-sanitizer-samples/MemoryTracker/ first since your current directory needs to contain MemoryTrackerPatches.fatbin (c.f. MemoryTracker.cpp:80). This program does not contain error handling since it is just a sample, so errors like this will not trigger an error message. Please let me know if that works!

1 Like

Thank you it works fine! I attach first 10 line.

$ cp compute-sanitizer-samples/MemoryTracker/MemoryTrackerPatches.fatbin .
$ LD_PRELOAD=~/sakaia/compute-sanitizer-samples/MemoryTracker/libMemoryTracker.so:/usr/local/cuda-11.4/compute-sanitizer/libsanitizer-public.so ./a.out | head
Kernel Launch: _Z8kernel_APdii
  Memory accesses: 1024
  [0] Read access of global memory by thread (32,0,0) at address 0x7f6362003100 (size is 8 bytes)
  [1] Read access of global memory by thread (33,0,0) at address 0x7f6362003108 (size is 8 bytes)
  [2] Read access of global memory by thread (0,0,0) at address 0x7f6362001600 (size is 8 bytes)
  [3] Read access of global memory by thread (32,0,0) at address 0x7f636200bd00 (size is 8 bytes)
  [4] Read access of global memory by thread (1,0,0) at address 0x7f6362001608 (size is 8 bytes)
  [5] Read access of global memory by thread (34,0,0) at address 0x7f6362003110 (size is 8 bytes)
  [6] Read access of global memory by thread (33,0,0) at address 0x7f636200bd08 (size is 8 bytes)
  [7] Read access of global memory by thread (2,0,0) at address 0x7f6362001610 (size is 8 bytes)
1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.