I have profiled my application using Nsight Systems in Jetson TX2 target and I observe an unreasonable behaviour. I would like to receive few insights from Nvidia for the same.
Timeline view of my application:
Application background & problem:
I’m concurrently executing two major threads namely 4693 and 4694 (can be seen in timeline view).
I find that kernel executions scheduled in thread 4693 is blocked whenever a Cuda memory or synchronization API is scheduled concurrently in 4694.
As a result, “pthread_muxtex_lock” is created and we observe idle GPU for certain time. (Note: concurrent copy and execute is not yet implemented)
However, I also observe mutex locks between two independent kernels as well.
In timeline view, kernels “copyPackedKernel” and “PyrDown” are blocked one upon another even though there is no data dependency and no memory related operations observable in GPU. Why is it so? This behaviour causes a huge idle time of GPU.
I would like to get a clear understanding to this behaviour to optimize my application. Thank you for your time!