I’m starting to profile Pytorch’s distributed dataparallel models with dlprof and I’ve noticed that it takes forever to generate the sqlite and it is huge: 22GB. I’m using a ResNet18 model with CIFAR10 dataset and only 3 epochs.
I wonder if it is a best practice to profile only one rank, say rank 0, and assume that other ranks would behave the same. Is that assumption accurate?
If not, is there any ‘best practices’ information/documentation on using dlprof with distributed dataparallel models (DDP) ?