CUPTI Profiler API on large program

The bundled examples are profiling a few kernels. They are good for clarity, but if If I want to profile an application that has many CUDA operators and kernels (hundreds to thousands), is it possible to use CUPTI profiling API to collect metric from beginning to the end? For example:

int main() {
  if (profile_enabled) {
  application.execute(); // many kernels and lots of C++ and non-CUDA code.
  if (profile_enabled) {
    results = extract_metric();

It’s similar to what nsight compute does. So I assume this is a reasonable use case for the profiling API? Is there anything I need to be careful to use it this way (i.e. data overflow, overheads, etc.)?

You can use the injection workflow for this, you need to create an injection library (let’s say in which code will be similar to the callback_profiling sample in CUPTI sample directory. Instead of subscribing to the callbacks in the main function as we have in callback_profiling sample, you need to subscribe callbacks inside a special function “void InitializeInjection()”. For understanding the injection workflow you can refer to cupti_finalize sample.

extern "C" int InitializeInjection(void) 
    std::cout << "Starting injection..." << std::endl;
    ProfilingData_t* profilingData = new ProfilingData_t();
    // set all the configuration for profiling like number of ranges, range and replay mode, counterdata file names etc.
    CUpti_SubscriberHandle subscriber;
    CUPTI_API_CALL(cuptiSubscribe(&subscriber, (CUpti_CallbackFunc)callbackHandler, profilingData));
    CUPTI_API_CALL(cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_DRIVER_API, CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel));

Set following environment variable export CUDA_INJECTION64_PATH=<full_path to injection library>/ Add CUPTI library in LD_LIBRARY_PATH and run any CUDA app you want to profile.

For profiling overhead, you can refer CUPTI :: CUPTI Documentation