The NVTX traces for Cuda HW and Threads have different execution ending while tracing the same function call

As you can see the screenshot below, there is a time gap between the endings of batch-forward recorded within Cuda HW and Thread.

Is the projection from cpu to gpu wrong?
Any clues would be appreciated, thank in advance!

This is exactly the screen I would expect to see.

The Batch-Forward (the CUDA API) is called on the CPU and the work is triggered on the GPU (the CUDA hardware). The GPU performs the work, and the CPU has to wait for the GPU to finish before it can move on (thus the cudaStreamSynchronize, which is why there is the pthread_cond_wait.

The correlation is showing that that work call on the CPU is leading to that work on the GPU.

If I were working on this, I would determine if that synchronization was needed.