Profiling DLRM ML training using nsight system

I am profiling a DLRM model during training and I noticed very large time interval between 2 consecutive iterations.

I can see that a pthread_cond_wait is causing this delay of 15s. However, when running it without the profiler, I don’t notice any such delay and I observe that the iterations complete at the expected rate.

Any pointers to explain this will be great.

Thanks!

Hmm. I certainly would not expect to see a 15 second delay here.

Can you tell me the command line you are running with?

sure, the following is my command:

/tmp/nsight-systems-2023.3.1/bin/nsys profile --stats=true --force-overwrite=true -o gpu01_dlrm_uvm_syn python3.7 $syn_dlrmpath/dlrm_s_uvm_pytorch.py --data-generation=$DATA_GEN --round-targets=True --learning-rate=1.0 --arch-mlp-bot=$BOT_MLP --arch-mlp-top=$TOP_MLP --arch-sparse-feature-size=$EMB_DIM --max-ind-range=40000000 --numpy-rand-seed=727 --num-batches=$NUM_BATCH --data-size 100000000 --num-indices-per-lookup=$EMB_LS --num-indices-per-lookup-fixed=True --arch-embedding-size=$EMB_TBL --print-freq=1 --print-time --mini-batch-size=$BS $EXTRA_FLAGS --use-gpu --break-point=3

Not sure if you have a reproducing code you can share, so I am going to ask you do to some investigation.

can you try minimizing the number of things that you are collecting?

try:

nsys profile --trace=cuda,nvtx --sample=none --stats=true --force-overwrite=true -o

Nsight systems default is to trace CUDA, OpenGL, NVTX, and OS runtime library APIs and collect CPU sampling information and thread scheduling information. I suspect that the issue may be in the OS runtime trace.