I am trying to develop an application that activates CUPTI’s activity api in the main thread, and then creates a child thread that CUDA code is run within. The child thread is an external application invoked through Boost’s system call. The CUPTI activity API is not picking up any GPU activity in the child process - is there a way to configure CUPTI to pick up GPU activity from the child thread?
Activating CUPTI’s activity APIs in the main thread enables profiling of the main thread as well as all of its child threads. In the scenario when main thread creates a new process (for example using fork), profiling won’t happen for the new process. Does the later apply for your use case?
To pick up GPU activity in the new process, you can either have CUPTI APIs as part of the application code or inject CUPTI based profiling library into the application process.
Correct, nvprof uses CUPTI under the hood. But CUPTI doesn’t provide the support for the child process profiling. It’s the responsibility of the CUPTI client to implement this support.
Thanks for your quick update. Base on those information, I am curious about how to using CUPTI to profiling GPU events for cuda executable? Say, when I using nvprof profiling cuda executable like below, I can get events information for both kernel. I know how to implement this with CUPTI client, but I don’t know how could I implement a tool which using CUPTI and can get GPU events for cuda executable…
nvprof -e fb_subp0_read_sectors ./concurrentKernels
[./concurrentKernels] - Starting…
==8361== NVPROF is profiling process 8361, command: ./concurrentKernels
GPU Device 0: “GeForce GTX 1050” with compute capability 6.1
Detected Compute SM 6.1 hardware with 5 multi-processors
Expected time for serial execution of 8 kernels = 0.080s
Expected time for concurrent execution of 8 kernels = 0.010s
Measured time for sample = 0.113s
Test passed
==8361== Profiling application: ./concurrentKernels
==8361== Profiling result:
==8361== Event result:
Invocations Event Name Min Max Avg Total
Device “GeForce GTX 1050 (0)”
Kernel: sum(long*, int)
1 fb_subp0_read_sectors 100 100 100 100
Kernel: clock_block(long*, long)
8 fb_subp0_read_sectors 112177 179004 153147 1225177
User can develop a profiling tool like nvprof by writing a CUPTI based shared library. This library needs to enable the appropriate CUPTI activities using the API cuptiActivityEnable() for tracing information, or call events/metrics API for profiling the GPU performance characteristics. For more control over the profiling session, user can use the CUPTI Callback API to register a callback into his code. Your callback will be invoked when the application being profiled calls a CUDA runtime or driver function, or when certain events occur in the CUDA driver. Refer CUPTI samples callback_event and callback_metric for the usage of CUPTI events and metrics APIs respectively.