Profiling cuda graph with CUPTI Profiling API

I have a program that utilizes the TensorRT API to inference AI models and enables CUDA GRAPH. I would like to profile this program using the CUPTI API. Within the CUPTI API, there are several types of profiling APIs available. One of them is the CuptiActivity API, which is capable of capturing kernels launched by TensorRT when CUDA GRAPH is enabled. However, the CuptiProfiling API does not capture any kernel launch events. The specific domain I have enabled in my code is as follows.

  CUPTI_CALL(cuptiEnableCallback(1, subscriber_, CUPTI_CB_DOMAIN_DRIVER_API,
                                 CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel));
  CUPTI_CALL(cuptiEnableCallback(1, subscriber_, CUPTI_CB_DOMAIN_RESOURCE, CUPTI_CBID_RESOURCE_GRAPHNODE_CREATED));
  CUPTI_CALL(cuptiEnableCallback(1, subscriber_, CUPTI_CB_DOMAIN_RESOURCE, CUPTI_CBID_RESOURCE_GRAPHNODE_CLONED));
  CUPTI_CALL(cuptiEnableDomain(1, subscriber_, CUPTI_CB_DOMAIN_RUNTIME_API));

I’m wondering if the CuptiProfiling API does not support capturing kernels launched with CUDA Graph, or if I am using the API incorrectly.

Looking forward to your response.

Hi zhi_xz,

CUPTI supports tracing and profiling of kernels launched using CUDA Graph.

CUPTI Activity APIs helps in collecting the tracing information including the start and end timestamps for CUDA and GPU activities.

For collection of GPU performance metrics, one needs to use the CUPTI Profiling APIs.
Assuming you wish to collect the profiling data for CUDA Graph launches, I see from the code block you have only enabled cuLaunchKernel callback. Along with that you will also need to subscribe to other callbacks like *_cuLaunchKernel_ptsz, *_cuGraphLaunch , *_cuGraphLaunch_ptsz. Refer to the profiling_injection sample for injection-based profiling workflow for CUDA application. In the sample, only cuLaunchKernel callback is enabled for simplicity but for your use case you can subscribe above callbacks and inside the ProfilerCallbackHandler function, you need to check if the callback id is one of the above and do the profiling as per you need.

Hi mjain,

Thank you for your response. I have tried the profiling_injection sample you mentioned. I have also made some modifications to the sample code in order to capture the CUDA kernel to CUDA graph. Here is the modified code:

// injection_2.cpp

// Copyright 2021 NVIDIA Corporation. All rights reserved
//
// This sample demostrates using the profiler API in injection mode.
// Build this file as a shared object, and point environment variable
// LD_PRELOAD to the full path to the .so.
//
// This works by intercepting dlsym, so as long as any call in the
// target application uses dlsym internally, this should work.  Several
// CUDA runtime calls use dlsym, as well as other standard libraries, so
// this is a reasonable assumption.
//
// The intercepted dlsym is only used to call a one-time initialization
// routine which will register CUPTI Callback API functions, but otherwise
// calls into the libc dlsym() function as normal.
//
// After the initialization routine  returns, the application resumes running,
// with the registered callbacks triggering as expected.  These callbacks
// are used to start a Profiler API session using Kernel Replay and
// Auto Range modes.
//
// A configurable number of kernel launches (default 10) are run
// under one session.  Before the 11th kernel launch, the callback
// ends the session, prints metrics, and starts a new session.
//
// An atexit callback is also used to ensure that any partial sessions
// are handled when the target application exits.
//
// This code supports multiple contexts and multithreading through
// locking shared data structures.

#include "cuda.h"
#include "cuda_runtime_api.h"
#include "cupti_callbacks.h"
#include "cupti_profiler_target.h"
#include "cupti_driver_cbid.h"
#include "cupti_target.h"
#include "cupti_activity.h"
#include "nvperf_host.h"

#include <Eval.h>
using ::NV::Metric::Eval::PrintMetricValues;

#include <Metric.h>
using ::NV::Metric::Config::GetConfigImage;
using ::NV::Metric::Config::GetCounterDataPrefixImage;

#include <Utils.h>
using ::NV::Metric::Utils::GetNVPWResultString;

#include <iostream>
using ::std::cerr;
using ::std::cout;
using ::std::endl;

#include <mutex>
using ::std::mutex;

#include <string>
using ::std::string;

#include <unordered_map>
using ::std::unordered_map;

#include <unordered_set>
using ::std::unordered_set;

#include <vector>
using ::std::vector;

#include "dlfcn.h" // dlsym, RTLD_NEXT
extern "C"
{
    extern typeof(dlsym) __libc_dlsym;
    extern typeof(dlopen) __libc_dlopen_mode;
}

// Export InitializeInjection symbol
#ifdef _WIN32
#define DLLEXPORT __declspec(dllexport)
#define HIDDEN
#else
#define DLLEXPORT __attribute__((visibility("default")))
#define HIDDEN __attribute__((visibility("hidden")))
#endif

// Helpful error handlers for standard CUPTI and CUDA runtime calls
#define CUPTI_API_CALL(apiFuncCall)                                            \
do {                                                                           \
    CUptiResult _status = apiFuncCall;                                         \
    if (_status != CUPTI_SUCCESS) {                                            \
        const char *errstr;                                                    \
        cuptiGetResultString(_status, &errstr);                                \
        fprintf(stderr, "%s:%d: error: function %s failed with error %s.\n",   \
                __FILE__, __LINE__, #apiFuncCall, errstr);                     \
        exit(EXIT_FAILURE);                                                    \
    }                                                                          \
} while (0)

#define RUNTIME_API_CALL(apiFuncCall)                                          \
do {                                                                           \
    cudaError_t _status = apiFuncCall;                                         \
    if (_status != cudaSuccess) {                                              \
        fprintf(stderr, "%s:%d: error: function %s failed with error %s.\n",   \
                __FILE__, __LINE__, #apiFuncCall, cudaGetErrorString(_status));\
        exit(EXIT_FAILURE);                                                    \
    }                                                                          \
} while (0)

#define MEMORY_ALLOCATION_CALL(var)                                            \
do {                                                                            \
    if (var == NULL) {                                                          \
        fprintf(stderr, "%s:%d: Error: Memory Allocation Failed \n",            \
                __FILE__, __LINE__);                                            \
        exit(EXIT_FAILURE);                                                     \
    }                                                                           \
} while (0)

#define DRIVER_API_CALL(apiFuncCall)                                           \
do {                                                                           \
    CUresult _status = apiFuncCall;                                            \
    if (_status != CUDA_SUCCESS) {                                             \
        fprintf(stderr, "%s:%d: error: function %s failed with error %d.\n",   \
                __FILE__, __LINE__, #apiFuncCall, _status);                    \
        exit(EXIT_FAILURE);                                                    \
    }                                                                          \
} while (0)

#define NVPW_API_CALL(apiFuncCall)                                             \
do {                                                                           \
    NVPA_Status _status = apiFuncCall;                                         \
    if (_status != NVPA_STATUS_SUCCESS) {                                      \
        fprintf(stderr, "%s:%d: error: function %s failed with error %s.\n",   \
            __FILE__, __LINE__, #apiFuncCall, GetNVPWResultString(_status));   \
    exit(EXIT_FAILURE);                                                        \
    }                                                                          \
} while (0)

// Profiler API configuration data, per-context
struct ctxProfilerData
{
    CUcontext       ctx;
    int             dev_id;
    cudaDeviceProp  dev_prop;
    vector<uint8_t> counterAvailabilityImage;
    CUpti_Profiler_CounterDataImageOptions counterDataImageOptions;
    vector<uint8_t> counterDataImage;
    vector<uint8_t> counterDataPrefixImage;
    vector<uint8_t> counterDataScratchBufferImage;
    vector<uint8_t> configImage;
    int             maxNumRanges;
    int             curRanges;
    int             maxRangeNameLength;
    int             iterations; // Count of sessions

    // Initialize fields, with env var overrides
    ctxProfilerData() : curRanges(), maxRangeNameLength(64), iterations()
    {
        char * env_var = getenv("INJECTION_KERNEL_COUNT");
        if (env_var != NULL)
        {
            int val = atoi(env_var);
            if (val < 1)
            {
                cerr << "Read " << val << " kernels from INJECTION_KERNEL_COUNT, but must be >= 1; defaulting to 10." << endl;
                val = 10;
            }
            maxNumRanges = val;
        }
        else
        {
            maxNumRanges = 10;
        }
    };
};

// Track per-context profiler API data in a shared map
mutex ctx_data_mutex;
unordered_map<CUcontext, ctxProfilerData> ctx_data;

// List of metrics to collect
vector<string> metricNames;

// Initialize state
void initialize_state()
{
    static int profiler_initialized = 0;

    if (profiler_initialized == 0)
    {
        // CUPTI Profiler API initialization
        CUpti_Profiler_Initialize_Params profilerInitializeParams = { CUpti_Profiler_Initialize_Params_STRUCT_SIZE };
        CUPTI_API_CALL(cuptiProfilerInitialize(&profilerInitializeParams));

        // NVPW required initialization
        NVPW_InitializeHost_Params initializeHostParams = { NVPW_InitializeHost_Params_STRUCT_SIZE };
        NVPW_API_CALL(NVPW_InitializeHost(&initializeHostParams));

        profiler_initialized = 1;
    }
}

// Initialize profiler for a context
void initialize_ctx_data(ctxProfilerData &ctx_data)
{
    initialize_state();

    // Get size of counterAvailabilityImage - in first pass, GetCounterAvailability return size needed for data
    CUpti_Profiler_GetCounterAvailability_Params getCounterAvailabilityParams = { CUpti_Profiler_GetCounterAvailability_Params_STRUCT_SIZE };
    getCounterAvailabilityParams.ctx = ctx_data.ctx;
    CUPTI_API_CALL(cuptiProfilerGetCounterAvailability(&getCounterAvailabilityParams));

    // Allocate sized counterAvailabilityImage
    ctx_data.counterAvailabilityImage.resize(getCounterAvailabilityParams.counterAvailabilityImageSize);

    // Initialize counterAvailabilityImage
    getCounterAvailabilityParams.pCounterAvailabilityImage = ctx_data.counterAvailabilityImage.data();
    CUPTI_API_CALL(cuptiProfilerGetCounterAvailability(&getCounterAvailabilityParams));

    // Fill in configImage - can be run on host or target
    if (!GetConfigImage(ctx_data.dev_prop.name, metricNames, ctx_data.configImage, ctx_data.counterAvailabilityImage.data()))
    {
        cerr << "Failed to create configImage for context " << ctx_data.ctx << endl;
        exit(EXIT_FAILURE);
    }

    // Fill in counterDataPrefixImage - can be run on host or target
    if (!GetCounterDataPrefixImage(ctx_data.dev_prop.name, metricNames, ctx_data.counterDataPrefixImage, ctx_data.counterAvailabilityImage.data()))
    {
        cerr << "Failed to create counterDataPrefixImage for context " << ctx_data.ctx << endl;
        exit(EXIT_FAILURE);
    }

    // Record counterDataPrefixImage info and other options for sizing the counterDataImage
    ctx_data.counterDataImageOptions.pCounterDataPrefix = ctx_data.counterDataPrefixImage.data();
    ctx_data.counterDataImageOptions.counterDataPrefixSize = ctx_data.counterDataPrefixImage.size();
    ctx_data.counterDataImageOptions.maxNumRanges = ctx_data.maxNumRanges;
    ctx_data.counterDataImageOptions.maxNumRangeTreeNodes = ctx_data.maxNumRanges;
    ctx_data.counterDataImageOptions.maxRangeNameLength = ctx_data.maxRangeNameLength;

    // Calculate size of counterDataImage based on counterDataPrefixImage and options
    CUpti_Profiler_CounterDataImage_CalculateSize_Params calculateSizeParams = { CUpti_Profiler_CounterDataImage_CalculateSize_Params_STRUCT_SIZE };
    calculateSizeParams.pOptions = &(ctx_data.counterDataImageOptions);
    calculateSizeParams.sizeofCounterDataImageOptions = CUpti_Profiler_CounterDataImageOptions_STRUCT_SIZE;
    CUPTI_API_CALL(cuptiProfilerCounterDataImageCalculateSize(&calculateSizeParams));
    // Create counterDataImage
    ctx_data.counterDataImage.resize(calculateSizeParams.counterDataImageSize);

    // Initialize counterDataImage inside start_session
    CUpti_Profiler_CounterDataImage_Initialize_Params initializeParams = { CUpti_Profiler_CounterDataImage_Initialize_Params_STRUCT_SIZE };
    initializeParams.pOptions = &(ctx_data.counterDataImageOptions);
    initializeParams.sizeofCounterDataImageOptions = CUpti_Profiler_CounterDataImageOptions_STRUCT_SIZE;
    initializeParams.counterDataImageSize = ctx_data.counterDataImage.size();
    initializeParams.pCounterDataImage = ctx_data.counterDataImage.data();
    CUPTI_API_CALL(cuptiProfilerCounterDataImageInitialize(&initializeParams));

    // Calculate scratchBuffer size based on counterDataImage size and counterDataImage
    CUpti_Profiler_CounterDataImage_CalculateScratchBufferSize_Params scratchBufferSizeParams = { CUpti_Profiler_CounterDataImage_CalculateScratchBufferSize_Params_STRUCT_SIZE };
    scratchBufferSizeParams.counterDataImageSize = ctx_data.counterDataImage.size();
    scratchBufferSizeParams.pCounterDataImage = ctx_data.counterDataImage.data();
    CUPTI_API_CALL(cuptiProfilerCounterDataImageCalculateScratchBufferSize(&scratchBufferSizeParams));
    // Create counterDataScratchBuffer
    ctx_data.counterDataScratchBufferImage.resize(scratchBufferSizeParams.counterDataScratchBufferSize);

    // Initialize counterDataScratchBuffer
    CUpti_Profiler_CounterDataImage_InitializeScratchBuffer_Params initScratchBufferParams = { CUpti_Profiler_CounterDataImage_InitializeScratchBuffer_Params_STRUCT_SIZE };
    initScratchBufferParams.counterDataImageSize = ctx_data.counterDataImage.size();
    initScratchBufferParams.pCounterDataImage = ctx_data.counterDataImage.data();
    initScratchBufferParams.counterDataScratchBufferSize = ctx_data.counterDataScratchBufferImage.size();;
    initScratchBufferParams.pCounterDataScratchBuffer = ctx_data.counterDataScratchBufferImage.data();
    CUPTI_API_CALL(cuptiProfilerCounterDataImageInitializeScratchBuffer(&initScratchBufferParams));

}

// Start a session
void start_session(ctxProfilerData &ctx_data)
{
    CUpti_Profiler_BeginSession_Params beginSessionParams = { CUpti_Profiler_BeginSession_Params_STRUCT_SIZE };
    beginSessionParams.counterDataImageSize = ctx_data.counterDataImage.size();
    beginSessionParams.pCounterDataImage = ctx_data.counterDataImage.data();
    beginSessionParams.counterDataScratchBufferSize = ctx_data.counterDataScratchBufferImage.size();
    beginSessionParams.pCounterDataScratchBuffer = ctx_data.counterDataScratchBufferImage.data();
    beginSessionParams.ctx = ctx_data.ctx;
    beginSessionParams.maxLaunchesPerPass = ctx_data.maxNumRanges;
    beginSessionParams.maxRangesPerPass = ctx_data.maxNumRanges;
    beginSessionParams.pPriv = NULL;
    beginSessionParams.range = CUPTI_AutoRange;
    beginSessionParams.replayMode = CUPTI_KernelReplay;
    CUPTI_API_CALL(cuptiProfilerBeginSession(&beginSessionParams));

    CUpti_Profiler_SetConfig_Params setConfigParams = { CUpti_Profiler_SetConfig_Params_STRUCT_SIZE };
    setConfigParams.pConfig = ctx_data.configImage.data();
    setConfigParams.configSize = ctx_data.configImage.size();
    setConfigParams.passIndex = 0; // Only set for Application Replay mode
    setConfigParams.minNestingLevel = 1;
    setConfigParams.numNestingLevels = 1;
    setConfigParams.targetNestingLevel = 1;
    CUPTI_API_CALL(cuptiProfilerSetConfig(&setConfigParams));

    CUpti_Profiler_EnableProfiling_Params enableProfilingParams = { CUpti_Profiler_EnableProfiling_Params_STRUCT_SIZE };
    enableProfilingParams.ctx = ctx_data.ctx;
    CUPTI_API_CALL(cuptiProfilerEnableProfiling(&enableProfilingParams));

    ctx_data.iterations++;
}

// Print session data
static void print_data(ctxProfilerData &ctx_data)
{
    cout << endl << "Context " << ctx_data.ctx << ", device " << ctx_data.dev_id << " (" << ctx_data.dev_prop.name << ") session " << ctx_data.iterations << ":" << endl;
    PrintMetricValues(ctx_data.dev_prop.name, ctx_data.counterDataImage, metricNames, ctx_data.counterAvailabilityImage.data());
}

// End a session during execution
void end_session(ctxProfilerData &ctx_data)
{
    CUpti_Profiler_DisableProfiling_Params disableProfilingParams = { CUpti_Profiler_DisableProfiling_Params_STRUCT_SIZE };
    disableProfilingParams.ctx = ctx_data.ctx;
    CUPTI_API_CALL(cuptiProfilerDisableProfiling(&disableProfilingParams));

    CUpti_Profiler_UnsetConfig_Params unsetConfigParams = { CUpti_Profiler_UnsetConfig_Params_STRUCT_SIZE };
    unsetConfigParams.ctx = ctx_data.ctx;
    CUPTI_API_CALL(cuptiProfilerUnsetConfig(&unsetConfigParams));

    CUpti_Profiler_EndSession_Params endSessionParams = { CUpti_Profiler_EndSession_Params_STRUCT_SIZE };
    endSessionParams.ctx = ctx_data.ctx;
    CUPTI_API_CALL(cuptiProfilerEndSession(&endSessionParams));

    print_data(ctx_data);

    // Clear counterDataImage (otherwise it maintains previous records when it is reused)
    CUpti_Profiler_CounterDataImage_Initialize_Params initializeParams = { CUpti_Profiler_CounterDataImage_Initialize_Params_STRUCT_SIZE };
    initializeParams.pOptions = &(ctx_data.counterDataImageOptions);
    initializeParams.sizeofCounterDataImageOptions = CUpti_Profiler_CounterDataImageOptions_STRUCT_SIZE;
    initializeParams.counterDataImageSize = ctx_data.counterDataImage.size();
    initializeParams.pCounterDataImage = ctx_data.counterDataImage.data();
    CUPTI_API_CALL(cuptiProfilerCounterDataImageInitialize(&initializeParams));
}

// Clean up at end of execution
static void end_execution()
{
    CUPTI_API_CALL(cuptiGetLastError());
    ctx_data_mutex.lock();

    for (auto itr = ctx_data.begin(); itr != ctx_data.end(); ++itr)
    {
        ctxProfilerData &data = itr->second;

        if (data.curRanges > 0)
        {
            print_data(data);
            data.curRanges = 0;
        }
    }

    ctx_data_mutex.unlock();
}

// Callback handler
void callback(void * userdata, CUpti_CallbackDomain domain, CUpti_CallbackId cbid, void const * cbdata)
{
    static int initialized = 0;

    CUptiResult res;
    if (domain == CUPTI_CB_DOMAIN_DRIVER_API)
    {
        // For a driver call to launch a kernel:
        if (cbid == CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel)
        {
            CUpti_CallbackData const * data = static_cast<CUpti_CallbackData const *>(cbdata);
            CUcontext ctx = data->context;

            // On entry, enable / update profiling as needed
            if (data->callbackSite == CUPTI_API_ENTER)
            {
                // Check for this context in the configured contexts
                // If not configured, it isn't compatible with profiling
                ctx_data_mutex.lock();
                if (ctx_data.count(ctx) > 0)
                {
                    // If at maximum number of ranges, end session and reset
                    if (ctx_data[ctx].curRanges == ctx_data[ctx].maxNumRanges)
                    {
                        end_session(ctx_data[ctx]);
                        ctx_data[ctx].curRanges = 0;
                    }

                    // If no currently enabled session on this context, start one
                    if (ctx_data[ctx].curRanges == 0)
                    {
                        initialize_ctx_data(ctx_data[ctx]);
                        start_session(ctx_data[ctx]);
                    }

                    // Increment curRanges
                    ctx_data[ctx].curRanges++;
                }
                ctx_data_mutex.unlock();
            }
        }
    }
    else if (domain == CUPTI_CB_DOMAIN_RESOURCE)
    {
        // When a context is created, check to see whether the device is compatible with the Profiler API
        if (cbid == CUPTI_CBID_RESOURCE_CONTEXT_CREATED)
        {
            CUpti_ResourceData const * res_data = static_cast<CUpti_ResourceData const *>(cbdata);
            CUcontext ctx = res_data->context;

            // Configure handler for new context under lock
            ctxProfilerData data = { };

            data.ctx = ctx;

            RUNTIME_API_CALL(cudaGetDevice(&(data.dev_id)));

            // Initialize profiler API and test device compatibility
            initialize_state();
            CUpti_Profiler_DeviceSupported_Params params = { CUpti_Profiler_DeviceSupported_Params_STRUCT_SIZE };
            params.cuDevice = data.dev_id;
            CUPTI_API_CALL(cuptiProfilerDeviceSupported(&params));

            // If valid for profiling, set up profiler and save to shared structure
            ctx_data_mutex.lock();
            if (params.isSupported == CUPTI_PROFILER_CONFIGURATION_SUPPORTED)
            {
                // Update shared structures
                ctx_data[ctx] = data;
                initialize_ctx_data(ctx_data[ctx]);
            }
            else
            {
                if (ctx_data.count(ctx))
                {
                    // Update shared structures
                    ctx_data.erase(ctx);
                }

                cerr << "libinjection_2: Unable to profile context on device " << data.dev_id << endl;

                if (params.architecture == CUPTI_PROFILER_CONFIGURATION_UNSUPPORTED)
                {
                    cerr << "\tdevice architecture is not supported" << endl;
                }

                if (params.sli == CUPTI_PROFILER_CONFIGURATION_UNSUPPORTED)
                {
                    cerr << "\tdevice sli configuration is not supported" << endl;
                }

                if (params.vGpu == CUPTI_PROFILER_CONFIGURATION_UNSUPPORTED)
                {
                    cerr << "\tdevice vgpu configuration is not supported" << endl;
                }
                else if (params.vGpu == CUPTI_PROFILER_CONFIGURATION_DISABLED)
                {
                    cerr << "\tdevice vgpu configuration disabled profiling support" << endl;
                }

                if (params.confidentialCompute == CUPTI_PROFILER_CONFIGURATION_UNSUPPORTED)
                {
                    cerr << "\tdevice confidential compute configuration is not supported" << endl;
                }

                if (params.cmp == CUPTI_PROFILER_CONFIGURATION_UNSUPPORTED)
                {
                    ::std::cerr << "\tNVIDIA Crypto Mining Processors (CMP) are not supported" << ::std::endl;
                }
            }
            ctx_data_mutex.unlock();
        }
    }

    return;
}

// Register callbacks for several points in target application execution
void register_callbacks()
{
    // One subscriber is used to register multiple callback domains
    CUpti_SubscriberHandle subscriber;
    CUPTI_API_CALL(cuptiSubscribe(&subscriber, (CUpti_CallbackFunc)callback, NULL));
    // Runtime callback domain is needed for kernel launch callbacks
    CUPTI_API_CALL(cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_DRIVER_API, CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel));
    // Resource callback domain is needed for context creation callbacks
    CUPTI_API_CALL(cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_RESOURCE, CUPTI_CBID_RESOURCE_CONTEXT_CREATED));
    CUPTI_API_CALL(cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_DRIVER_API, CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel_ptsz));
    CUPTI_API_CALL(cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_DRIVER_API, CUPTI_DRIVER_TRACE_CBID_cuGraphLaunch));
    CUPTI_API_CALL(cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_DRIVER_API, CUPTI_DRIVER_TRACE_CBID_cuGraphLaunch_ptsz));
    // Register callback for application exit
    atexit(end_execution);
}

static bool injectionInitialized = false;

// InitializeInjection should be called before the first CUDA function in the
// target application.  It cannot call any CUDA runtime or driver code, but
// the CUPTI Callback API is supported at this point.
extern "C" DLLEXPORT int InitializeInjection()
{
    if (injectionInitialized == false)
    {
        injectionInitialized = true;

        // Read in optional list of metrics to gather
        char * metrics_env = getenv("INJECTION_METRICS");
        if (metrics_env != NULL)
        {
            char * tok = strtok(metrics_env, " ;,");
            do
            {
                cout << "Read " << tok << endl;
                metricNames.push_back(string(tok));
                tok = strtok(NULL, " ;,");
            } while (tok != NULL);
        }
        else
        {
            metricNames.push_back("sm__cycles_elapsed.avg");
            metricNames.push_back("smsp__sass_thread_inst_executed_op_dadd_pred_on.avg");
            metricNames.push_back("smsp__sass_thread_inst_executed_op_dfma_pred_on.avg");
        }

        // Subscribe to some callbacks
        register_callbacks();
    }
    return 1;
}

// Whether the application calls the runtime or driver CUDA API, dynamic
// linking will likely use dlsym - intercept this call with LD_PRELOAD to
// have a convenient place to initialize Cupti Callback API.
extern "C" DLLEXPORT void * dlsym(void * handle, char const * symbol)
{
    InitializeInjection();

    typedef void * (*dlsym_fn)(void *, char const *);
    static dlsym_fn real_dlsym = NULL;
    if (real_dlsym == NULL)
    {
        // Use libc internal names to avoid recursive call
        real_dlsym = (dlsym_fn)(__libc_dlsym(__libc_dlopen_mode("libdl.so", RTLD_LAZY), "dlsym"));
    }
    if (real_dlsym == NULL)
    {
        cerr << "Error finding real dlsym symbol" << endl;
        return NULL;
    }
    return real_dlsym(handle, symbol);
}
// complex_target.cu

// Copyright 2021 NVIDIA Corporation. All rights reserved
//
// This is a sample CUDA application with several different kernel launch
// patterns - launching on the default stream, multple streams, and multiple
// threads on different devices, if more than one device is present.
//
// The injection sample shared library can be used on this sample application,
// demonstrating that the injection code handles multple streams and multiple
// threads.

// Standard CUDA 
#include "cuda.h"
#include "cuda_runtime_api.h"
#include "driver_types.h"

// Standard STL headers
#include <chrono>
#include <cstdint>
#include <iostream>
using ::std::cerr;
using ::std::cout;
using ::std::endl;

#include <string>
using ::std::string;

#include <thread>
using ::std::thread;

#include <vector>
using ::std::vector;

// Helpful error handlers for standard CUDA runtime calls
#define RUNTIME_API_CALL(apiFuncCall)                                          \
do {                                                                           \
    cudaError_t _status = apiFuncCall;                                         \
    if (_status != cudaSuccess) {                                              \
        fprintf(stderr, "%s:%d: error: function %s failed with error %s.\n",   \
                __FILE__, __LINE__, #apiFuncCall, cudaGetErrorString(_status));\
        exit(EXIT_FAILURE);                                                    \
    }                                                                          \
} while (0)

#define MEMORY_ALLOCATION_CALL(var)                                             \
do {                                                                            \
    if (var == NULL) {                                                          \
        fprintf(stderr, "%s:%d: Error: Memory Allocation Failed \n",            \
                __FILE__, __LINE__);                                            \
        exit(EXIT_FAILURE);                                                     \
    }                                                                           \
} while (0)

#define DRIVER_API_CALL(apiFuncCall)                                           \
do {                                                                           \
    CUresult _status = apiFuncCall;                                            \
    if (_status != CUDA_SUCCESS) {                                             \
        fprintf(stderr, "%s:%d: error: function %s failed with error %d.\n",   \
                __FILE__, __LINE__, #apiFuncCall, _status);                    \
        exit(EXIT_FAILURE);                                                    \
    }                                                                          \
} while (0)

// Per-device configuration, buffers, stream and device information, and device pointers
typedef struct {
    int deviceID;
    CUcontext context;         //!< CUDA driver context, or NULL if default context has already been initialized
    vector<cudaStream_t> streams;           // Each device needs its own streams
    vector<double *> d_x;                   // And device memory allocation
    vector<double *> d_y;                   // ..
} perDeviceData;

#define DAXPY_REPEAT 32768
// Loop over array of elements performing daxpy multiple times
// To be launched with only one block (artificially increasing serial time to better demonstrate overlapping replay)
__global__ void daxpyKernel(int elements, double a, double * x, double * y)
{
    for (int i = threadIdx.x; i < elements; i += blockDim.x)
        // Artificially increase kernel runtime to emphasize concurrency
        for (int j = 0; j < DAXPY_REPEAT; j++)
            y[i] = a * x[i] + y[i]; // daxpy
}

// Initialize kernel values
double a = 2.5;

// Normally you would want multiple warps, but to emphasize concurrency with streams and multiple devices
// we run the kernels on a single warp.
int threadsPerBlock = 32;
int threadBlocks = 1;

// Configurable number of kernels (streams, when running concurrently)
int const numKernels = 4;
int const numStreams = numKernels;
vector<size_t> elements(numKernels);

// Each kernel call allocates and computes (call number) * (blockSize) elements
// For 4 calls, this is 4k elements * 2 arrays * (1 + 2 + 3 + 4 stream mul) * 8B/elem =~ 640KB
int const blockSize = 4 * 1024;

// Wrapper which will launch numKernel kernel calls on a single device
// The device streams vector is used to control which stream each call is made on
// If 'serial' is non-zero, the device streams are ignored and instead the default stream is used
void launchKernels(perDeviceData &d, bool use_cuda_graph, bool serial)
{
    // Switch to desired device
    RUNTIME_API_CALL(cudaSetDevice(d.deviceID));
    DRIVER_API_CALL(cuCtxSetCurrent(d.context));
    bool is_graph_created_ = false;
    cudaGraphExec_t graph_exec_ = nullptr;
    for (unsigned int stream = 0; stream < d.streams.size(); stream++)
    {
        cudaStream_t streamId = ((serial && !use_cuda_graph) ? 0 : d.streams[stream]);
        if (use_cuda_graph) {
            for (int i = 0; i < 3; i++) {
                if (!is_graph_created_) {
                    cudaGraph_t cuda_graph = nullptr;
                    RUNTIME_API_CALL(cudaStreamBeginCapture(streamId, cudaStreamCaptureModeRelaxed));
                    daxpyKernel <<<threadBlocks, threadsPerBlock, 0, streamId>>> (elements[stream], a, d.d_x[stream], d.d_y[stream]);
                    cudaError_t err = cudaGetLastError();
                    if (err != cudaSuccess) {
                        printf("Error after daxpyKernel: %s\n", cudaGetErrorString(err));
                    }
                    RUNTIME_API_CALL(cudaStreamEndCapture(streamId, &cuda_graph));
                    RUNTIME_API_CALL(cudaGraphInstantiate(&graph_exec_, cuda_graph, nullptr, nullptr, 0));
                    RUNTIME_API_CALL(cudaGraphDestroy(cuda_graph));
                    is_graph_created_ = true;
                }
                RUNTIME_API_CALL(cudaGraphLaunch(graph_exec_, streamId));
            }
            RUNTIME_API_CALL(cudaGetLastError());
        } else {
            for (int i = 0; i < 3; i++) {
                daxpyKernel <<<threadBlocks, threadsPerBlock, 0, streamId>>> (elements[stream], a, d.d_x[stream], d.d_y[stream]);
            }
        }
    }

    // After launching all work, synchronize all streams
    if (serial == false)
    {
        for (unsigned int stream = 0; stream < d.streams.size(); stream++)
        {
            RUNTIME_API_CALL(cudaStreamSynchronize(d.streams[stream]));
        }
    }
    else
    {
        RUNTIME_API_CALL(cudaStreamSynchronize(0));
    }
}


int main(int argc, char * argv[])
{
    int numDevices;
    RUNTIME_API_CALL(cudaGetDeviceCount(&numDevices));
    bool use_cuda_graph_ = false;
    if (argc > 1) {
        char * use_cuda_graph = argv[1];
        if (use_cuda_graph != NULL) {
            printf("use_cuda_graph: %s\n", use_cuda_graph);
        }
        use_cuda_graph_ = std::stoi(use_cuda_graph);
    }
    // Per-device information
    vector<int> device_ids;

    // Find all devices
    for (int i = 0; i < numDevices; i++)
    {
        // Record device number
        device_ids.push_back(i);
    }

    numDevices = device_ids.size();
    cout << "Found " << numDevices << " devices" << endl;

    // Ensure we found at least one device
    if (numDevices == 0)
    {
        cerr << "No devices detected" << endl;
        exit(-1);
    }

    // Initialize kernel input to some known numbers
    vector<double> h_x(blockSize * numKernels);
    vector<double> h_y(blockSize * numKernels);
    for (size_t i = 0; i < blockSize * numKernels; i++)
    {
        h_x[i] = 1.5 * i;
        h_y[i] = 2.0 * (i - 3000);
    }

    // Initialize a vector of 'default stream' values to demonstrate serialized kernels
    vector<cudaStream_t> defaultStreams(numStreams);
    for (int stream = 0; stream < numStreams; stream++)
    {
        defaultStreams[stream] = 0;
    }

    // Scale per-kernel work by stream number
    for (int stream = 0; stream < numStreams; stream++)
    {
        elements[stream] = blockSize * (stream + 1);
    }

    // For each device, configure profiling, set up buffers, copy kernel data
    vector<perDeviceData> deviceData(numDevices);

    for (int device = 0; device < numDevices; device++)
    {
        RUNTIME_API_CALL(cudaSetDevice(device_ids[device]));
        cout << "Configuring device " << device_ids[device] << endl;

        // For simplicity's sake, in this sample, a single config struct is created per device
        deviceData[device].deviceID = device_ids[device];// GPU device ID

        DRIVER_API_CALL(cuCtxCreate(&(deviceData[device].context), 0, device_ids[device])); // Either set to a context, or may be NULL if a default context has been created

        // Per-stream initialization & memory allocation - copy from constant host array to each device array
        deviceData[device].streams.resize(numStreams);
        deviceData[device].d_x.resize(numStreams);
        deviceData[device].d_y.resize(numStreams);
        for (int stream = 0; stream < numStreams; stream++)
        {
            RUNTIME_API_CALL(cudaStreamCreate(&(deviceData[device].streams[stream])));

            // Each kernel does (stream #) * blockSize work on doubles
            size_t size = elements[stream] * sizeof(double);

            RUNTIME_API_CALL(cudaMalloc(&(deviceData[device].d_x[stream]), size));
            MEMORY_ALLOCATION_CALL(deviceData[device].d_x[stream]); // Validate pointer
            RUNTIME_API_CALL(cudaMemcpy(deviceData[device].d_x[stream], h_x.data(), size, cudaMemcpyHostToDevice));

            RUNTIME_API_CALL(cudaMalloc(&(deviceData[device].d_y[stream]), size));
            MEMORY_ALLOCATION_CALL(deviceData[device].d_y[stream]); // Validate pointer
            RUNTIME_API_CALL(cudaMemcpy(deviceData[device].d_y[stream], h_x.data(), size, cudaMemcpyHostToDevice));
        }
    }

    //
    // First version - single device, kernel calls serialized on default stream
    //

    // Use wallclock time to measure performance
    auto begin_time = ::std::chrono::high_resolution_clock::now();

    // Run on first device and use default streams - will show runtime without any concurrency
    launchKernels(deviceData[0], use_cuda_graph_, true);

    auto end_time = ::std::chrono::high_resolution_clock::now();
    auto elapsed_serial_ms = ::std::chrono::duration_cast<::std::chrono::milliseconds>(end_time - begin_time);
    cout << "It took " << elapsed_serial_ms.count() << "ms on the host to launch " << numKernels << " kernels in serial" << endl; 

    //
    // Second version - same kernel calls as before on the same device, but now using separate streams for concurrency
    // (Should be limited by the longest running kernel)
    //

    begin_time = ::std::chrono::high_resolution_clock::now();

    // Still only use first device, but this time use its allocated streams for parallelism
    launchKernels(deviceData[0], use_cuda_graph_, false);

    end_time = ::std::chrono::high_resolution_clock::now();
    auto elapsed_single_device_ms = ::std::chrono::duration_cast<::std::chrono::milliseconds>(end_time - begin_time);
    cout << "It took " << elapsed_single_device_ms.count() << "ms on the host to launch " << numKernels << " kernels on a single device on separate streams" << endl;

    //
    // Third version - same as the second case, but duplicate the work across devices to show cross-device concurrency
    // This is done using threads so no serialization is needed between devices
    // (Should have roughly the same runtime as second case)
    //

    // Time creation of the same multiple streams * multiple devices
    vector<::std::thread> threads;
    begin_time = ::std::chrono::high_resolution_clock::now();

    // Now launch parallel thread work, duplicated on one thread per gpu
    for (int device = 0; device < numDevices; device++)
    {
        threads.push_back(::std::thread(launchKernels, ::std::ref(deviceData[device]), use_cuda_graph_, false));
    }

    // Wait for all threads to finish
    for (auto &t: threads)
    {
        t.join();
    }

    // Record time used when launching on multiple devices
    end_time = ::std::chrono::high_resolution_clock::now();
    auto elapsed_multiple_device_ms = ::std::chrono::duration_cast<::std::chrono::milliseconds>(end_time - begin_time);
    cout << "It took " << elapsed_multiple_device_ms.count() << "ms on the host to launch the same " << numKernels << " kernels on each of the " << numDevices << " devices in parallel" << endl;

    // Free stream memory for each device
    for (int i = 0; i < numDevices; i++)
    {
        for (int j = 0; j < numKernels; j++)
        {
            RUNTIME_API_CALL(cudaFree(deviceData[i].d_x[j]));
            RUNTIME_API_CALL(cudaFree(deviceData[i].d_y[j]));
        }
    }

    return 0;
}

I have included an option that allows you to choose whether or not to use CUDA graph. The value 0 indicates not to use CUDA graph, while 1 indicates to use CUDA graph.

However, there seems to be an issue with one of the test cases. Enabling CUDA graph and using libinjection_2.so is causing an error to occur. Specifically, in complex_target.cu on line 123, the function cudaStreamEndCapture(streamId, &cuda_graph) failed due to a previous error during capture.

To run this particular test case, you can use the following command:

sudo LD_PRELOAD=./libinjection_2.so LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:pwd ./complex_target 1

Could you please assist me in identifying the cause of this error?