On the above screenshot, are a few iterations of Pytorch Googlenet training. Each iteration is marked with an NVTX region. You can see the gaps are afther the 3rd, the 6th and the 9th iterations.
The only activity captured by the nsys profiler in these gaps are OS runtime “sem_timedwait”-s.
The training ran on a p3.2xlarge instance on AWS cloud in a Docker container.
OS Ubuntu 18.04.3 LTS,
NVIDIA driver 430.64,
CUDA10.1,
cuDNN 7.6.5.32-1.
Any ideas of what is going on during these gaps and what is the cause?
I measured the time of reading training samples (Imagenet), and it looks like reading training samples from the disk is the cause of the gaps.
I used three reading processes here ( num_workers parameter of Pytorch Dataloader class ).
I observed similar reading caused delays on mini-batch sizes around 30 and larger. So you need a really fast SSD to efficiently train Googlenet or other CNNs with large training samples like Imagenet on V100 GPU.