Negative latencies

image
Hi,

I’m using CUPTI to profile the latency from the CPU issuing a kernel to the kernel’s execution. I get a timestamp at the entry of CUPTI_RUNTIME_TRACE_CBID_cudaLaunchKernel_v7000 by calling cuptiGetTimestamp. I then use the start timestamp of CUpti_ActivityKernel9 (from CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL) and subtract the timestamp I got from the callback function. To ensure everything is consistent, I also retrieve the start timestamp of CUpti_ActivityAPI from CUPTI_ACTIVITY_KIND_RUNTIME, using correlationId to match them. The attached image shows the results.

For correlationId 23679 (not the cbid in the image), the kernel start time from CUpti_ActivityKernel9 is 1732487239483534075. The timestamp from the entry of the callback function is 1732487239483576598. The start time of CUpti_ActivityAPI is 1732487239483561019. The start timestamp of the kernel is lower than the other two timestamps. However, it seems impossible for kernel execution to begin before its corresponding runtime API call. Why does this happen? This issue occurs frequently in our application.

I’m using CUDA Toolkit 12.4. The application is executed using the torchrun command, which involves multi-threading.

Hi, @frankchen8508

Thanks for reporting this issue. This seems like a issue we recently fixed. Can you please use the latest CUDA12.6 Update 2 toolkit to have a try ? Thanks !