Repeating Kernel launch for Cupti Events

First, I wanted to know, why do you need to relaunch the kernel to collect different metrics and events in NVPROF, which really leads to a lot of performance drop as you may arrive to a point that for every event or a metric there is a new kernel launch.

Second question, how can I repeat or relaunch the kernel for multiple events (CUPTI code wise), as in CUPTI if I added multiple event groups I can’t enable them for the same context so should I create a new context for every eventGroup, and is there any sample or example that is doing this in CUPTI?

NVIDIA GPU hardware has a limited number of counter registers and thus we cannot collect all possible counters concurrently. There are also limitations on which events/metrics can be collected together in a single pass. This we might have to replay the same kernel multiple times to collect all the requested events/metrics.

You can use the API cuptiEnableKernelReplayMode() to enable replaying of the kernels. You will be able to enable any number of event groups and all the contained events will be collected. There is no need to create separate context for every eventGroup.

Code snippet:

cuptiEnableKernelReplayMode(ctx);
cuptiMetricCreateEventGroupSets(ctx, metricArraySize, metricArray, &eventGroupSetArray);
for (i = 0; i < eventGroupSetArray->numSets; i++) {
  for (j = 0; j < eventGroupSetArray->sets[i].numEventGroups; j++) {
    cuptiEventGroupEnable(eventGroupSetArray->sets[i].eventGroups[j]);
  }
}
kernel << <1, 1 >> > ();
cudaDeviceSynchronize();
for (i = 0; i < eventGroupSetArray->numSets; i++) {
  for (j = 0; j < eventGroupSetArray->sets[i].numEventGroups; j++) {
    cuptiEventGroupReadAllEvents(eventGroupSetArray->sets[i].eventGroups[j], ...);
    cuptiEventGroupDisable(eventGroupSetArray->sets[i].eventGroups[j]);
  }
}
cuptiEventGroupSetsDestroy(eventGroupSetArray);