Dlprof with pytorch's distributed dataparallel


I’m starting to profile Pytorch’s distributed dataparallel models with dlprof and I’ve noticed that it takes forever to generate the sqlite and it is huge: 22GB. I’m using a ResNet18 model with CIFAR10 dataset and only 3 epochs.

I wonder if it is a best practice to profile only one rank, say rank 0, and assume that other ranks would behave the same. Is that assumption accurate?
If not, is there any ‘best practices’ information/documentation on using dlprof with distributed dataparallel models (DDP) ?


Please note that DLProf has been sunsetted for many months and is no longer supported. Please use nsight systems or the native pytorch profiler.

Too bad… I’ve used nsight systems and found a little convoluted how to get insights from it. DLProf at least could summarize many information. I also found it more useful than pytorch profiler.

Thanks for your reply!