Get launch kernel response time by CUPTI

zhi_xz · May 4, 2023, 11:59am

Hi,

I want to get two metrics defined below.

Response time: This is the time duration between the start of a kernel launch on the CPU side using CUDA syntax, such as vectorAdd<<<>>>, and the beginning of the kernel execution on the GPU.
Execution time: This refers to the total time taken for the kernel to execute on the GPU.

I am using CUPTI APIs to obtain the metrics.

To retrieve the execution time of a kernel, I capture the CUpti_ActivityKernel4 activity and calculate the difference between the start and end timestamps to obtain the kernel’s execution time. Is this approach correct?

I’m not entirely sure how to retrieve the response time, but in my opinion, the vectorAdd<<<>>> function would call the driver API cuLaunchKernel, which can be captured using the CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel activity. Based on this, we could calculate the difference between the start of the cuLaunchKernel and the start of the CUpti_ActivityKernel4 activity to obtain the kernel’s response time. However, after testing my theory, it seems that the start timestamp of the CUpti_ActivityKernel4 activity is actually earlier than the start timestamp of cuLaunchKernel.

May I ask which specific operations or APIs are invoked when vectorAdd<<<>>> is called from the CPU side, and how could I obtain the response time of the kernel?

I eagerly anticipate your response, thanks!

zhi_xz · May 4, 2023, 12:05pm

it seems that the start timestamp of the CUpti_ActivityKernel4 activity is actually earlier than the start timestamp of cuLaunchKernel .

Considering that the start timestamp of the CUpti_ActivityKernel4 activity precedes that of cuLaunchKernel, would it be correct to infer that the overall duration of the CUpti_ActivityKernel4 activity involves more than just the execution time of the kernel on the GPU?

RahulDhoot · May 5, 2023, 9:40am

Hi zhi_xz,

You can use the below approaches to get the response time and execution time.

Execution time: Enable the activity CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL.
Get the difference between the start and end timestamps of the kernel record.

Response time: You can use either of the 2 approaches -

Enable the activity CUPTI_ACTIVITY_KIND_RUNTIME and CUPTI_ACTIVITY_KIND_DRIVER.

Get the difference between start timestamp of kernel record and start timestamp of activity record.

Note that the ‘correlationId’ field should be used to correlate kernel and activity records if there are multiple launches.

Enable the callback CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel. ``

CUPTI will invoke this callback twice for the API, at the entry and exit point of the API.

This can be identified from the ‘callbackSite’ parameter.
Take the timestamp at the entry point. Use cuptiGetTimestamp to get the timestamp.

Get the difference between this timestamp and the start timestamp of activity record.

I would recommend using option(1).

The kernel record only includes the execution time of the kernel on the GPU.
If you observed the start timestamp of the kernel activity precedes that of cuLaunchKernel, it must be because either

You used a different clock source to get timestamp in the CUPTI callback or
You collected the timestamp at the exit point of the API.

zhi_xz · May 5, 2023, 2:20pm

Thank you for your response.

I apologize for my mistake in stating that the start timestamp of the CUpti_ActivityKernel4 activity was earlier than the start timestamp of cuLaunchKernel.

After further investigation, I noticed that the activity record callback always comes before the cudaLaunchKernel API callback. Is this normal in your experience?

RahulDhoot · May 8, 2023, 11:45am

Hi zhi_xz,

When you say activity record callback, do you mean the ‘funcBufferCompleted’ callback?

zhi_xz · May 8, 2023, 3:43pm

void GpuProfiler::handleActivityRecord(CUpti_Activity* record) {
  ICHECK_NOTNULL(record);
  switch (record->kind) {
    case CUPTI_ACTIVITY_KIND_KERNEL:
    case CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL: {
      // handle kernel
      break;
    }
    case CUPTI_ACTIVITY_KIND_RUNTIME:{
      // handle CUPTI_ACTIVITY_KIND_RUNTIME
      break;
    }
    case CUPTI_ACTIVITY_KIND_DRIVER: {
      // handle CUPTI_ACTIVITY_KIND_DRIVER
      break;
    }
    default:
      LOG_E("Unsupported activity record kind: %d", record->kind);
  }
}

The code provided above serves as the handler for the funcBufferCompleted callback. What I would like to highlight is that the CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL kind kernel execution activity always comes ahead of the CUPTI_ACTIVITY_KIND_DRIVER kind API cudaLaunchKernel.
It seems highly unusual since the kernel can only be executed through the cudaLaunchKernel API.

RahulDhoot · May 9, 2023, 4:43am

Hi zhi_xz,
CUPTI invokes the ‘funcBufferCompleted’ callback when the user buffer is completed.
CUPTI does not guarantee ordering of the activities in the buffer.

system · May 23, 2023, 4:43am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Negative latencies CUPTI – CUDA Profiler Tools Interface	2	307	June 16, 2025
Using CUpti_ActivityKernel4 to find the start and end time in ns for a kernel wrapped in a function CUPTI – CUDA Profiler Tools Interface	7	1136	January 23, 2020
Timing Data from Nsight Systems (from the sqlite report) Profiling Linux Targets	1	554	March 6, 2020
Kernel time discrepancy between nsys profile and cudaEventElapsedTime Profiling Linux Targets cuda , kernel , profiling	4	785	April 28, 2023
different results with cupti and nvprof. CUPTI – CUDA Profiler Tools Interface	2	819	March 31, 2020
CUPTI Activity API giving asynchronous events with bogus(?) timestamps CUPTI – CUDA Profiler Tools Interface	1	614	November 6, 2019
Is there a way to know when a cuda call will generate device activity? CUPTI – CUDA Profiler Tools Interface	8	330	June 5, 2025
CUPTI Sample tutorial wrong CUPTI – CUDA Profiler Tools Interface cuda , kernel	1	96	July 7, 2025
Getting different time for kernel execution. CUDA Programming and Performance	6	5906	July 30, 2009
Can We use CUPTI for Run-Time Analysis of Cuda Applications for GPU Metrics CUPTI – CUDA Profiler Tools Interface	4	940	January 15, 2024

Get launch kernel response time by CUPTI

Related topics