DLA and GPU cores at the same time

Hello AastaLLL,

I would like to ask you a question base your provided code.

I profiled your code and I saw the result as in the image:

Based on the image:
After the inference (ExecutionContext::enqueue), there are cudaEventRecord and cudaEventSynchronize.
I would like to ask about what is the total execution time of inference. Does it include the time of cudaEventRecord and cudaEventSynchronize or only time of (ExecutionContext::enqueue) ?

Thank you.