In the yellow box upon hovering on a kernel call in stream 15, we can see that end - start is 46.531 microseconds. But the latency is reported as 7.347 microseconds. Why are they not the same? What is end - start capturing that latency is not capturing? Also I noticed the latency in the yellow box is the same as that noted in the corresponding launch kernel call in CUDA API events log in NSys.
Thanks
Indeed a well-written piece. Thanks for the article. It helps clarify the doubt. So latency is the time between the time when the API was enqueued, and the time the GPU started executing it. And duration in the cuda API trace is the CPU wrapper overhead.