Behaviour of pthread_muxtex_lock

Hi,

I have profiled my application using Nsight Systems in Jetson TX2 target and I observe an unreasonable behaviour. I would like to receive few insights from Nvidia for the same.

Timeline view of my application:
[https://drive.google.com/file/d/1d7oYR4zkDGJ6kSbv66g6PiYHJ06_OQaV/view?usp=sharing]

Application background & problem:
I’m concurrently executing two major threads namely 4693 and 4694 (can be seen in timeline view).
I find that kernel executions scheduled in thread 4693 is blocked whenever a Cuda memory or synchronization API is scheduled concurrently in 4694.
As a result, “pthread_muxtex_lock” is created and we observe idle GPU for certain time. (Note: concurrent copy and execute is not yet implemented)

However, I also observe mutex locks between two independent kernels as well.
In timeline view, kernels “copyPackedKernel” and “PyrDown” are blocked one upon another even though there is no data dependency and no memory related operations observable in GPU. Why is it so? This behaviour causes a huge idle time of GPU.

I would like to get a clear understanding to this behaviour to optimize my application. Thank you for your time!
nvidia.png

Hi Viknesh,

I believe what you are seeing here is contention for the context lock in the CUDA driver, but I can’t be 100% sure of that from just a screenshot. One particular question I have is about the range from +455ms to +465ms, are there any other threads where you see blockage on an ioctl? And how many threads total are doing CUDA work here? Would you mind sharing the .qdrep file as well? If you are not comfortable posting a link to it in a public forum, I can give you direct contact info. If I have the full .qdrep file, it would help me consult with the CUDA driver team and have a more complete picture of what is going on.

Thanks!

  • Jason (Nsight Systems)

Hi Jason,

Thank you for the reply. It would be grateful if you can share your contact info. I shall send the qdrep file.

To answer your question, I observe a total of 36 threads of which two does the CUDA work. And I do observe a blockage of ioctl in thread ID 4707 between +456.15ms to +462.7ms .

I think if contention was to occur then a similar behaviour is to be observed when two other kernels are mutex locked. But they are not observed frequently in the log.

Looking forward to your contact information to continue the discussion further. Nice day!

Regards,
Viknesh

Hey Jason,

Perhaps, I was wondering if the behaviour is due to implicit synchronization (Kernel waiting for the memory operation to compelete) in one of the two streams? The same is the reason for major prolonged scheduling of kerenels in the report?

Regards,
Viknesh