I am profiling a DLRM model during training and I noticed very large time interval between 2 consecutive iterations.
I can see that a pthread_cond_wait is causing this delay of 15s. However, when running it without the profiler, I don’t notice any such delay and I observe that the iterations complete at the expected rate.
Nsight systems default is to trace CUDA, OpenGL, NVTX, and OS runtime library APIs and collect CPU sampling information and thread scheduling information. I suspect that the issue may be in the OS runtime trace.