I want to monitor software overhead.
For this purpose, I want to use perf mem equivalent tool.
From seeing Nsight Compute Kernel Profiling Guide,
Whole CUDA Kernel memory accessing behavior can take.
But it cannnot see the functions addressing.
Let me confirm, MemoryTracker is available only on source code?
Since compute-sanitizer command just supports Memcheck, RaceCheck, Initcheck and Synccheck.
MemoryTracker is an example code using the compute sanitizer patching API. The example is available only as source code. We do not provide a pre-built binary for it.
Which GPU are you using? I suspect you are running code on an SM architecture that is not explicitly supported by the sample build configuration. Locally, I have:
$ LD_PRELOAD=./libMemoryTracker.so LD_LIBRARY_PATH=$SANITIZER_LIBRARIES ./test | grep -B1 -A5 'Memory accesses'
Kernel Launch: _Z8kernel_APdii
Memory accesses: 1024
[0] Read access of global memory by thread (32,0,0) at address 0x7f7828015b00 (size is 8 bytes)
[1] Read access of global memory by thread (32,0,0) at address 0x7f7828010100 (size is 8 bytes)
[2] Read access of global memory by thread (0,0,0) at address 0x7f782800c800 (size is 8 bytes)
[3] Read access of global memory by thread (32,0,0) at address 0x7f7828014300 (size is 8 bytes)
[4] Read access of global memory by thread (32,0,0) at address 0x7f7828005f00 (size is 8 bytes)
--
Kernel Launch: _Z8kernel_BPdii
Memory accesses: 1024
[0] Read access of global memory by thread (0,0,0) at address 0x7f7828003e00 (size is 8 bytes)
[1] Read access of global memory by thread (0,0,0) at address 0x7f7828005600 (size is 8 bytes)
[2] Read access of global memory by thread (0,0,0) at address 0x7f7828001000 (size is 8 bytes)
[3] Read access of global memory by thread (0,0,0) at address 0x7f7828005400 (size is 8 bytes)
[4] Read access of global memory by thread (0,0,0) at address 0x7f7828002800 (size is 8 bytes)
--
Kernel Launch: _Z8kernel_CPdPKdi
Memory accesses: 1024
[0] Read access of global memory by thread (32,0,0) at address 0x7f782000b028 (size is 8 bytes)
[1] Read access of global memory by thread (32,0,0) at address 0x7f7820015030 (size is 8 bytes)
[2] Read access of global memory by thread (33,0,0) at address 0x7f782000b0a8 (size is 8 bytes)
[3] Read access of global memory by thread (33,0,0) at address 0x7f78200150b0 (size is 8 bytes)
[4] Read access of global memory by thread (34,0,0) at address 0x7f782000b128 (size is 8 bytes)
We will consider adding more GPU architectures to the build in our samples. In the meantime, feel free to manually add your SM architecture to MemoryTracker/Makefile. If you have additional questions, please let me know!
I am using GeForce RTX 2070 (CC7.5) (with CUDA 11.3) and Tesla A100 (CC8.0) (with CUDA 11.4). But the output keeps the same. Even I change SMS parameter to 80.
Is there any additional option to compile MemoryTracker ?
I forgot to mention that you need to do cd ~/sakaia/compute-sanitizer-samples/MemoryTracker/ first since your current directory needs to contain MemoryTrackerPatches.fatbin (c.f. MemoryTracker.cpp:80). This program does not contain error handling since it is just a sample, so errors like this will not trigger an error message. Please let me know if that works!
$ cp compute-sanitizer-samples/MemoryTracker/MemoryTrackerPatches.fatbin .
$ LD_PRELOAD=~/sakaia/compute-sanitizer-samples/MemoryTracker/libMemoryTracker.so:/usr/local/cuda-11.4/compute-sanitizer/libsanitizer-public.so ./a.out | head
Kernel Launch: _Z8kernel_APdii
Memory accesses: 1024
[0] Read access of global memory by thread (32,0,0) at address 0x7f6362003100 (size is 8 bytes)
[1] Read access of global memory by thread (33,0,0) at address 0x7f6362003108 (size is 8 bytes)
[2] Read access of global memory by thread (0,0,0) at address 0x7f6362001600 (size is 8 bytes)
[3] Read access of global memory by thread (32,0,0) at address 0x7f636200bd00 (size is 8 bytes)
[4] Read access of global memory by thread (1,0,0) at address 0x7f6362001608 (size is 8 bytes)
[5] Read access of global memory by thread (34,0,0) at address 0x7f6362003110 (size is 8 bytes)
[6] Read access of global memory by thread (33,0,0) at address 0x7f636200bd08 (size is 8 bytes)
[7] Read access of global memory by thread (2,0,0) at address 0x7f6362001610 (size is 8 bytes)