Profiling DLRM ML training using nsight system

chris241 · November 27, 2023, 7:13pm

I am profiling a DLRM model during training and I noticed very large time interval between 2 consecutive iterations.

I can see that a pthread_cond_wait is causing this delay of 15s. However, when running it without the profiler, I don’t notice any such delay and I observe that the iterations complete at the expected rate.

Any pointers to explain this will be great.

Thanks!

hwilper · November 28, 2023, 6:39pm

Hmm. I certainly would not expect to see a 15 second delay here.

Can you tell me the command line you are running with?

chris241 · November 28, 2023, 9:17pm

sure, the following is my command:

/tmp/nsight-systems-2023.3.1/bin/nsys profile --stats=true --force-overwrite=true -o gpu01_dlrm_uvm_syn python3.7 $syn_dlrmpath/dlrm_s_uvm_pytorch.py --data-generation=$DATA_GEN --round-targets=True --learning-rate=1.0 --arch-mlp-bot=$BOT_MLP --arch-mlp-top=$TOP_MLP --arch-sparse-feature-size=$EMB_DIM --max-ind-range=40000000 --numpy-rand-seed=727 --num-batches=$NUM_BATCH --data-size 100000000 --num-indices-per-lookup=$EMB_LS --num-indices-per-lookup-fixed=True --arch-embedding-size=$EMB_TBL --print-freq=1 --print-time --mini-batch-size=$BS $EXTRA_FLAGS --use-gpu --break-point=3

hwilper · November 29, 2023, 8:35pm

Not sure if you have a reproducing code you can share, so I am going to ask you do to some investigation.

can you try minimizing the number of things that you are collecting?

try:

nsys profile --trace=cuda,nvtx --sample=none --stats=true --force-overwrite=true -o

Nsight systems default is to trace CUDA, OpenGL, NVTX, and OS runtime library APIs and collect CPU sampling information and thread scheduling information. I suspect that the issue may be in the OS runtime trace.

Topic		Replies	Views
nv-nsight-cu-cli profiles every kernel 47x, is very slow Profiling Linux Targets	2	1154	October 12, 2021
Different SM frequency when using profiler on trtexec Jetson AGX Xavier nsight	4	925	October 18, 2021
Nsight extremely slow Profiling Linux Targets tensorflow , ubuntu , nsight	0	495	December 2, 2020
Time To Profile CUDA Programming and Performance	11	5706	October 20, 2011
Profiling DCGan Tutorial Spins forever Nsight Compute	13	1265	June 7, 2020
Profiling and Optimizing Deep Neural Networks with DLProf and PyProf Technical Blog	13	1529	August 11, 2021
Why would code run 1.7x faster when run with nvprof than without? CUDA Programming and Performance	35	3397	December 28, 2017
nsight systems not seeing profile ranges when DLA is enabled Jetson AGX Xavier	11	3316	October 18, 2021
Slower in profiling CUDA Programming and Performance	1	492	October 31, 2016
Compute CLI hangs when profiling PyTorch application Nsight Compute	8	1888	August 6, 2019

Profiling DLRM ML training using nsight system

Related topics