Peculiar time gaps in CNN training

In profiles of Pytorch CNN training in Nsight Systems I can see strange gaps between every three or so training iterations.

On the above screenshot, are a few iterations of Pytorch Googlenet training. Each iteration is marked with an NVTX region. You can see the gaps are afther the 3rd, the 6th and the 9th iterations.

The only activity captured by the nsys profiler in these gaps are OS runtime “sem_timedwait”-s.

The training ran on a p3.2xlarge instance on AWS cloud in a Docker container.

OS Ubuntu 18.04.3 LTS,
NVIDIA driver 430.64,
CUDA10.1,
cuDNN 7.6.5.32-1.

Any ideas of what is going on during these gaps and what is the cause?

Moved this post under the tool “Nsight Systems”

I measured the time of reading training samples (Imagenet), and it looks like reading training samples from the disk is the cause of the gaps.

I used three reading processes here ( num_workers parameter of Pytorch Dataloader class ).

I observed similar reading caused delays on mini-batch sizes around 30 and larger. So you need a really fast SSD to efficiently train Googlenet or other CNNs with large training samples like Imagenet on V100 GPU.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.