CUPTI activity API and child processes

Hello,

I am trying to develop an application that activates CUPTI’s activity api in the main thread, and then creates a child thread that CUDA code is run within. The child thread is an external application invoked through Boost’s system call. The CUPTI activity API is not picking up any GPU activity in the child process - is there a way to configure CUPTI to pick up GPU activity from the child thread?

Thank you!

Activating CUPTI’s activity APIs in the main thread enables profiling of the main thread as well as all of its child threads. In the scenario when main thread creates a new process (for example using fork), profiling won’t happen for the new process. Does the later apply for your use case?

To pick up GPU activity in the new process, you can either have CUPTI APIs as part of the application code or inject CUPTI based profiling library into the application process.

Is there any sample code to show us how to using CUPTI API profile a new process(using fork) ?

Unfortunately there is not any child process example code using CUPTI provided by NVIDIA.

Does nvprof using CUPTI? I found that nvprof’s logfile format is using CUPTI, but I don’t think it using CUPTI directly.

Correct, nvprof uses CUPTI under the hood. But CUPTI doesn’t provide the support for the child process profiling. It’s the responsibility of the CUPTI client to implement this support.

Thanks for your quick update. Base on those information, I am curious about how to using CUPTI to profiling GPU events for cuda executable? Say, when I using nvprof profiling cuda executable like below, I can get events information for both kernel. I know how to implement this with CUPTI client, but I don’t know how could I implement a tool which using CUPTI and can get GPU events for cuda executable…

nvprof -e fb_subp0_read_sectors ./concurrentKernels
[./concurrentKernels] - Starting…
==8361== NVPROF is profiling process 8361, command: ./concurrentKernels
GPU Device 0: “GeForce GTX 1050” with compute capability 6.1

Detected Compute SM 6.1 hardware with 5 multi-processors
Expected time for serial execution of 8 kernels = 0.080s
Expected time for concurrent execution of 8 kernels = 0.010s
Measured time for sample = 0.113s
Test passed
==8361== Profiling application: ./concurrentKernels
==8361== Profiling result:
==8361== Event result:
Invocations Event Name Min Max Avg Total
Device “GeForce GTX 1050 (0)”
Kernel: sum(long*, int)
1 fb_subp0_read_sectors 100 100 100 100
Kernel: clock_block(long*, long)
8 fb_subp0_read_sectors 112177 179004 153147 1225177

User can develop a profiling tool like nvprof by writing a CUPTI based shared library. This library needs to enable the appropriate CUPTI activities using the API cuptiActivityEnable() for tracing information, or call events/metrics API for profiling the GPU performance characteristics. For more control over the profiling session, user can use the CUPTI Callback API to register a callback into his code. Your callback will be invoked when the application being profiled calls a CUDA runtime or driver function, or when certain events occur in the CUDA driver. Refer CUPTI samples callback_event and callback_metric for the usage of CUPTI events and metrics APIs respectively.

For main and child process profiling, user can inject the shared library into the target application, e.g. using LD_PRELOAD or Detours, or by modifying the target application itself if applicable. From that library, user can initialize CUPTI for the whole target process. Refer below links for more information:
ld.so(8) - Linux manual page (for LD_PRELOAD)
https://github.com/Microsoft/Detours (for Detours)
https://docs.nvidia.com/cuda/cupti/index.html#r_initialization (for CUPTI initialization)

For the linker to work and to read events it needs the context handler so how can I get the context handler to provide to the linker?