Is there a performance issue in the Release 19.05 ?

Hi all,
My platform is 8xP100 system with NVLINK.

GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  CPU Affinity
GPU0     X      NV1     NV1     NV1     NV1     PHB     PHB     PHB     PHB     0-47
GPU1    NV1      X      NV1     NV1     PHB     NV1     PHB     PHB     PHB     0-47
GPU2    NV1     NV1      X      NV1     PHB     PHB     NV1     PHB     PHB     0-47
GPU3    NV1     NV1     NV1      X      PHB     PHB     PHB     NV1     PHB     0-47
GPU4    NV1     PHB     PHB     PHB      X      NV1     NV1     NV1     PHB     0-47
GPU5    PHB     NV1     PHB     PHB     NV1      X      NV1     NV1     PHB     0-47
GPU6    PHB     PHB     NV1     PHB     NV1     NV1      X      NV1     PHB     0-47
GPU7    PHB     PHB     PHB     NV1     NV1     NV1     NV1      X      PHB     0-47
mlx5_0  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X

I expect the performance should better than the results in https://www.tensorflow.org/guide/performance/benchmarks.
However the results are very bad relative to original TF.
I use the same data(ILSVRC2012) and batch size.

According to the official benchmark results. The performance of inception_v3 on P100x4 is about 569 images/sec. Here I got 500.

root@67293c7d5e95:/workspace/nvidia-examples/cnn# mpiexec --allow-run-as-root --bind-to socket -np 4 -x CUDA_VISIBLE_DEVICES=4,5,6,7  python inception_v3.py --data_dir=/data/learning/tf/models/research/inception/
inception/data/ILSVRC2012 -u batch -i 1000 -b 64 --display_every 50

  Step Epoch Img/sec   Loss  LR
     1   0.0    24.6  7.034  8.146 1.00000
    50   0.0   398.0  6.847  7.942 0.99978
   100   0.0   500.6  6.685  7.699 0.99956
   150   0.0   501.4  6.756  7.690 0.99934
   200   0.0   500.5  6.544  7.404 0.99912
   250   0.0   500.6  6.253  7.046 0.99889
   300   0.1   499.1  6.507  7.241 0.99867
   350   0.1   500.3  6.355  7.035 0.99845
   400   0.1   497.2  6.032  6.666 0.99823

According to the official benchmark result. The performance of alexnet on P100x4 P100x2 is about 10509 4448 images/sec. Here I got the results about 1600 images/sec.

root@5ae653a38f76:/workspace/nvidia-examples/cnn# mpiexec --allow-run-as-root --bind-to socket -np 2 -x CUDA_VISIBLE_DEVICES=4,5 python alexnet.py --data_dir=/data/learning/tf/models/research/inception/inception/
data/ILSVRC2012 -u batch -i 10000 -b 64 --display_every 100

  Step Epoch Img/sec   Loss  LR
     1   0.0    45.2  6.931 10.005 1.00000
   100   0.0  1579.9  6.894  8.366 0.99978
   200   0.0  1701.3  6.900  7.463 0.99956
   300   0.0  1860.8  6.915  7.145 0.99934
   400   0.0  1855.6  6.909  7.010 0.99911
   500   0.0  1848.7  6.902  6.950 0.99889
   600   0.1  1622.4  6.908  6.934 0.99867
   700   0.1  1915.6  6.909  6.925 0.99845
   800   0.1  1756.3  6.902  6.912 0.99823
   900   0.1  1576.8  6.912  6.920 0.99800
  1000   0.1  1671.1  6.917  6.922 0.99778
  1100   0.1  1881.2  6.910  6.915 0.99756
  1200   0.1  1532.3  6.904  6.908 0.99734
  1300   0.1  1772.3  6.911  6.914 0.99712
  1400   0.1  1856.2  6.910  6.913 0.99690
  1500   0.1  1815.5  6.896  6.898 0.99667
  1600   0.2  1574.7  6.909  6.911 0.99645
  1700   0.2  1589.9  6.902  6.904 0.99623
  1800   0.2  1778.7  6.904  6.906 0.99601

I also run the same model resnet50v1.5 on Release 19.05 TF and Pytorch.
The results show there is a scalability issue in TF.
The result of TF on 4xP100 is 613 images/sec, however the result of pytorch is 857 img/sec.

Running your commands a P100 DGX-1 I see roughly the following.

Inception_v3 4xP100 bs=64 full-sized images: ~570 img/sec
AlexNet 2xP100 bs=64 full-sized images: ~2300 img/sec
ResnetV1.5 4xP100 bs=64 full-sized images: ~720 img/sec

These numbers are a little better than what you report, but that may well be caused by differing kernel/driver versions. Particularly whether your kernel includes Spectre/Meltdown patches (my system has not been updated with those).

For AlexNet in particular, the IO pipeline can be a serious bottleneck. The 10000 img/sec result in the benchmarking page you note is with synthetic data and a batch size of 512 images/gpu. Real data was reported much lower at 7100 img/sec. In the 19.05 container, if I run with a data set where the input images have been pre-resized to 480px on the shortest side and boost the batch size to 256 images/gpu I can actually achieve performance similar to the synthetic data results.

AlexNet 4xP100 bs=256 480px-short-side: ~10200 img/sec

If you are not able to reproduce, please post your kernel and NVIDIA driver versions and describe how your data set was preprocessed.

Hi nluehr,
I update the results of alexnet here. Are my results similar to your results without resizing?
Do you think the setting of the experiment of the link is not clear?
My Linux kernel is 3.10.0-862.14.4.el7.x86_64 and the driver is 410.48.
Can the combination of kernel and driver achieve the ideal performance?

root@c7488d10611b:/workspace/nvidia-examples/cnn# # mpiexec --allow-run-as-root --bind-to socket -np 2 -x CUDA_VISIBLE_DEVICES=4,5 python alexnet.py --data_dir=/data/learning/tf/models/research/inception/inceptio
n/data/ILSVRC2012 -u batch -i 10000 -b 512 --display_every 100 --log_dir /data/learning/ --export_dir /data/learning/

  Step Epoch Img/sec   Loss  LR
     1   0.0   320.1  6.921  9.995 1.00000
   100   0.1  2456.1  6.846  8.256 0.99824
   200   0.2  2599.5  6.784  7.607 0.99647
   300   0.2  2595.4  6.749  7.326 0.99470
   400   0.3  2570.5  6.688  7.139 0.99292
   500   0.4  2577.7  6.555  6.934 0.99116
   600   0.5  2512.9  6.562  6.902 0.98939

....
root@e480de39b832:/workspace/nvidia-examples/cnn# mpiexec --allow-run-as-root --bind-to socket -np 4 -x CUDA_VISIBLE_DEVICES=4,5,6,7 python alexnet.py --data_dir=/data/learning/tf/models/research/inception/incept
ion/data/ILSVRC2012 -u batch -i 10000 -b <b>256 </b>--display_every 100 --log_dir /data/learning/ --export_dir /data/learning/

  Step Epoch Img/sec   Loss  LR
     1   0.0   356.9  6.915  9.989 1.00000
   100   0.1  4020.3  6.878  8.322 0.99824
   200   0.2  4165.4  6.760  7.605 0.99647
   300   0.2  4187.7  6.682  7.268 0.99470
   400   0.3  4193.9  6.652  7.097 0.99292
   500   0.4  4308.5  6.649  7.021 0.99116
   600   0.5  4223.3  6.557  6.892 0.98939
   700   0.6  4179.3  6.345  6.658 0.98762
   800   0.6  4293.5  6.271  6.569 0.98586
   900   0.7  4343.1  6.481  6.767 0.98409
root@c7488d10611b:/workspace/nvidia-examples/cnn# mpiexec --allow-run-as-root --bind-to socket -np 4 -x CUDA_VISIBLE_DEVICES=4,5,6,7 python alexnet.py --data_dir=/data/learning/tf/models/research/inception/incept
ion/data/ILSVRC2012 -u batch -i 10000 -b <b>512 </b>--display_every 100 --log_dir /data/learning/ --export_dir /data/learning/

     1   0.0   608.8  6.908  9.982 1.00000
   100   0.2  4366.8  6.859  8.234 0.99649
   200   0.3  4435.6  6.669  7.520 0.99294
   300   0.5  4610.2  6.566  7.203 0.98941
   400   0.6  4561.3  6.308  6.844 0.98587
   500   0.8  4613.2  6.242  6.720 0.98235
   600   1.0  4526.1  6.103  6.542 0.97883
   700   1.1  4563.5  5.965  6.377 0.97532
   800   1.3  4511.3  5.820  6.211 0.97182

With full sized images I see the following.

2xP100 batch_size=512 ~2990 img/sec
4xP100 batch_size=256 ~6000 img/sec
4xP100 batch_size=512 ~6500 img/sec

At least for CentOS, kernels after 3.10.0-693.11.6.el7.x86_64 include Spectre/Meltdown mitigations. These likely explain the lower perf you are seeing. With 4 GPUs the CPU bottleneck is most severe, and the CPU perf loss due to kernel patches would be most noticeable.

Given that you are CPU limited, you could consider using DALI to move much of the input pipeline to the GPU. You will need to create index files for your tfrecord inputs (see https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/dataloading_tfrecord.html). Then just add --use_dali --data_idx_dir=/path/to/dali_index_files to your commands above. Here is what I see with DALI enabled.

4xP100 batch_size=256 +dali ~7300 img/sec