I expect the performance should better than the results in https://www.tensorflow.org/guide/performance/benchmarks.
However the results are very bad relative to original TF.
I use the same data(ILSVRC2012) and batch size.
According to the official benchmark results. The performance of inception_v3 on P100x4 is about 569 images/sec. Here I got 500.
According to the official benchmark result. The performance of alexnet on P100x4 P100x2 is about 10509 4448 images/sec. Here I got the results about 1600 images/sec.
I also run the same model resnet50v1.5 on Release 19.05 TF and Pytorch.
The results show there is a scalability issue in TF.
The result of TF on 4xP100 is 613 images/sec, however the result of pytorch is 857 img/sec.
These numbers are a little better than what you report, but that may well be caused by differing kernel/driver versions. Particularly whether your kernel includes Spectre/Meltdown patches (my system has not been updated with those).
For AlexNet in particular, the IO pipeline can be a serious bottleneck. The 10000 img/sec result in the benchmarking page you note is with synthetic data and a batch size of 512 images/gpu. Real data was reported much lower at 7100 img/sec. In the 19.05 container, if I run with a data set where the input images have been pre-resized to 480px on the shortest side and boost the batch size to 256 images/gpu I can actually achieve performance similar to the synthetic data results.
Hi nluehr,
I update the results of alexnet here. Are my results similar to your results without resizing?
Do you think the setting of the experiment of the link is not clear?
My Linux kernel is 3.10.0-862.14.4.el7.x86_64 and the driver is 410.48.
Can the combination of kernel and driver achieve the ideal performance?
At least for CentOS, kernels after 3.10.0-693.11.6.el7.x86_64 include Spectre/Meltdown mitigations. These likely explain the lower perf you are seeing. With 4 GPUs the CPU bottleneck is most severe, and the CPU perf loss due to kernel patches would be most noticeable.
Given that you are CPU limited, you could consider using DALI to move much of the input pipeline to the GPU. You will need to create index files for your tfrecord inputs (see https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/dataloading_tfrecord.html). Then just add --use_dali --data_idx_dir=/path/to/dali_index_files to your commands above. Here is what I see with DALI enabled.