Get launch kernel response time by CUPTI

Hi,

I want to get two metrics defined below.

  • Response time: This is the time duration between the start of a kernel launch on the CPU side using CUDA syntax, such as vectorAdd<<<>>>, and the beginning of the kernel execution on the GPU.
  • Execution time: This refers to the total time taken for the kernel to execute on the GPU.

I am using CUPTI APIs to obtain the metrics.

To retrieve the execution time of a kernel, I capture the CUpti_ActivityKernel4 activity and calculate the difference between the start and end timestamps to obtain the kernel’s execution time. Is this approach correct?

I’m not entirely sure how to retrieve the response time, but in my opinion, the vectorAdd<<<>>> function would call the driver API cuLaunchKernel, which can be captured using the CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel activity. Based on this, we could calculate the difference between the start of the cuLaunchKernel and the start of the CUpti_ActivityKernel4 activity to obtain the kernel’s response time. However, after testing my theory, it seems that the start timestamp of the CUpti_ActivityKernel4 activity is actually earlier than the start timestamp of cuLaunchKernel.

May I ask which specific operations or APIs are invoked when vectorAdd<<<>>> is called from the CPU side, and how could I obtain the response time of the kernel?

I eagerly anticipate your response, thanks!

it seems that the start timestamp of the CUpti_ActivityKernel4 activity is actually earlier than the start timestamp of cuLaunchKernel .

Considering that the start timestamp of the CUpti_ActivityKernel4 activity precedes that of cuLaunchKernel, would it be correct to infer that the overall duration of the CUpti_ActivityKernel4 activity involves more than just the execution time of the kernel on the GPU?

Hi zhi_xz,

You can use the below approaches to get the response time and execution time.

Execution time: Enable the activity CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL.
Get the difference between the start and end timestamps of the kernel record.

Response time: You can use either of the 2 approaches -

  1. Enable the activity CUPTI_ACTIVITY_KIND_RUNTIME and CUPTI_ACTIVITY_KIND_DRIVER.

Get the difference between start timestamp of kernel record and start timestamp of activity record.

Note that the ‘correlationId’ field should be used to correlate kernel and activity records if there are multiple launches.

  1. Enable the callback CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel. ``

CUPTI will invoke this callback twice for the API, at the entry and exit point of the API.

This can be identified from the ‘callbackSite’ parameter.
Take the timestamp at the entry point. Use cuptiGetTimestamp to get the timestamp.

Get the difference between this timestamp and the start timestamp of activity record.

I would recommend using option(1).

The kernel record only includes the execution time of the kernel on the GPU.
If you observed the start timestamp of the kernel activity precedes that of cuLaunchKernel, it must be because either

  1. You used a different clock source to get timestamp in the CUPTI callback or
  2. You collected the timestamp at the exit point of the API.

Thank you for your response.

I apologize for my mistake in stating that the start timestamp of the CUpti_ActivityKernel4 activity was earlier than the start timestamp of cuLaunchKernel.

After further investigation, I noticed that the activity record callback always comes before the cudaLaunchKernel API callback. Is this normal in your experience?

Hi zhi_xz,

When you say activity record callback, do you mean the ‘funcBufferCompleted’ callback?

void GpuProfiler::handleActivityRecord(CUpti_Activity* record) {
  ICHECK_NOTNULL(record);
  switch (record->kind) {
    case CUPTI_ACTIVITY_KIND_KERNEL:
    case CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL: {
      // handle kernel
      break;
    }
    case CUPTI_ACTIVITY_KIND_RUNTIME:{
      // handle CUPTI_ACTIVITY_KIND_RUNTIME
      break;
    }
    case CUPTI_ACTIVITY_KIND_DRIVER: {
      // handle CUPTI_ACTIVITY_KIND_DRIVER
      break;
    }
    default:
      LOG_E("Unsupported activity record kind: %d", record->kind);
  }
}

The code provided above serves as the handler for the funcBufferCompleted callback. What I would like to highlight is that the CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL kind kernel execution activity always comes ahead of the CUPTI_ACTIVITY_KIND_DRIVER kind API cudaLaunchKernel.
It seems highly unusual since the kernel can only be executed through the cudaLaunchKernel API.

Hi zhi_xz,
CUPTI invokes the ‘funcBufferCompleted’ callback when the user buffer is completed.
CUPTI does not guarantee ordering of the activities in the buffer.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.