How to get CUPTI metric values for a CUDA program with more than one kernel execution?

Dear CUDA CUPTI developers,

I have a question related to the use of CUPTI API. Please let me know if this is not the right forum for this question.

Please forgive me for this verbose post.

I am using one of the samples ‘callback_metric’ in the directory extras/CUPTI/sample for studying CUPTI API. This sample shows how to use both the callback and metric APIs to record the metric’s events during the execution of a simple kernel, and then use those events to calculate the metric value.

I am using a GPU QuadroK4000 and CUDA6.0 toolkit for my experimental runs on UBUNTU Linux 14.04.

The restriction here is that the CUDA program profiled for CUPTI metrics and events must have just one kernel execution.

I would like to know if this example can be reused to get CUPTI metrics for a CUDA program with more than one kernel execution. For example, if I use the same approach (as in the sample callback_metric) to run the simpleCUFFT example or the radixsortThrust example, I get the following error:

error: too many events collected, metric expects only 2
error: too many events collected, metric expects only 2

Eventually I get other errors and my application (which collects all the Capability 3.x metrics) grinds to a halt. The problem is more than one kernel execution (which violates possibly one current limitation).

One approach is as follows:

// setup launch callback for event collection
// allocate space to hold all the events needed for the metric
// get the number of passes required to collect all the events
// needed for the metric and the event groups for each pass
execute_kernel_A(…);
// use all the collected events to calculate the metric value

// setup launch callback for event collection
// allocate space to hold all the events needed for the metric
// get the number of passes required to collect all the events
// needed for the metric and the event groups for each pass
execute_kernel_B(…);
// use all the collected events to calculate the metric value

But this approach gives me metric values for the individual kernel executions and not for the CUDA program (containing these kernel executions). Also, it is incorrect to sum the metric values for all the kernel executions.

Could you please let me know if there is a simple method to get these metric values for a CUDA program containing more than one kernel execution?

Best Regards
Ravi