NVPROF - How Does It Work?

Hi all - title says it all. Hoping this is a relatively decent place to ask. As we all know, nvprof is a great tool that takes as arguments a CUDA binary and program arguments, and somehow manages to collect profiling counters on behalf of that program - from the invoking process.

I want to create nvprof for my own purposes - tracking a set of CUPTI events from a binary. This is part of my own research project to estimate power on Tegra systems where I have models for power, and need to collect various information - such as (but not limited to) CUPTI event counts - to estimate running power usage.

Been sitting here for quite a few hours trying to wrap my head around it. Nvprof appears to be using libdl and libpthread, also finding some calls to execp, and some sources where nvprof doesn’t work because dlopen fails on opening libcupti… To me suggesting some clever use of creating a thread and somehow executing the binary in that thread, while the main thread is holding the CUPTI library data and callback routines. But in principle this should be impossible without position-independent code in the profiled binary; as you cannot run two executables in the same process… This would mess up libc initialisation etc. And it seems really illogical. So how is it done? Am I on the right track?

Or could it maybe be some weird mmap()'ing of libcupti library data between two processes?

Puzzled