Hi,
I’m using CUPTI to profile the latency from the CPU issuing a kernel to the kernel’s execution. I get a timestamp at the entry of CUPTI_RUNTIME_TRACE_CBID_cudaLaunchKernel_v7000
by calling cuptiGetTimestamp
. I then use the start timestamp of CUpti_ActivityKernel9
(from CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL
) and subtract the timestamp I got from the callback function. To ensure everything is consistent, I also retrieve the start timestamp of CUpti_ActivityAPI
from CUPTI_ACTIVITY_KIND_RUNTIME
, using correlationId to match them. The attached image shows the results.
For correlationId
23679 (not the cbid
in the image), the kernel start time from CUpti_ActivityKernel9
is 1732487239483534075. The timestamp from the entry of the callback function is 1732487239483576598. The start time of CUpti_ActivityAPI
is 1732487239483561019. The start timestamp of the kernel is lower than the other two timestamps. However, it seems impossible for kernel execution to begin before its corresponding runtime API call. Why does this happen? This issue occurs frequently in our application.
I’m using CUDA Toolkit 12.4. The application is executed using the torchrun command, which involves multi-threading.