CUPTI activity API and child processes

User can develop a profiling tool like nvprof by writing a CUPTI based shared library. This library needs to enable the appropriate CUPTI activities using the API cuptiActivityEnable() for tracing information, or call events/metrics API for profiling the GPU performance characteristics. For more control over the profiling session, user can use the CUPTI Callback API to register a callback into his code. Your callback will be invoked when the application being profiled calls a CUDA runtime or driver function, or when certain events occur in the CUDA driver. Refer CUPTI samples callback_event and callback_metric for the usage of CUPTI events and metrics APIs respectively.

For main and child process profiling, user can inject the shared library into the target application, e.g. using LD_PRELOAD or Detours, or by modifying the target application itself if applicable. From that library, user can initialize CUPTI for the whole target process. Refer below links for more information:
ld.so(8) - Linux manual page (for LD_PRELOAD)
https://github.com/Microsoft/Detours (for Detours)
https://docs.nvidia.com/cuda/cupti/index.html#r_initialization (for CUPTI initialization)