We use cuda9.0, cudnn7.1.4, nccl2.2.12 and cuda driver384.125 on our ubuntu16.04 host, and tensorflow container 18.05-py3 from NGC.
The tensorflow version is 1.8.0 on our host, and tf benchmark version is master both on host and in container.
For one piece of V100, the container reaches the highest images/sec over more than 7000 steps, but the host reaches the same images/sec just over 400 steps. It seems like testing in container is slower than on host, but the stable images/sec is the same both on host and in container.
If we use P100, there is no such problem.
Here is the perf record results of cpu cycle. Seemed each calls are slower than host.
In container:
Samples: 793K of event 'cpu-clock', Event count (approx.): 198255000000
Overhead Command Shared Object Symbol
14.18% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> decode_mcu
7.60% tf_cnn_benchmar _distort_image_ops.so <li> std::_Function_handler<void (long long, long long), tensorflow::AdjustHsvInYiqOp<Eigen::ThreadPoolDevice>::DoCompute(tensorf
5.95% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> tensorflow::(anonymous namespace)::resize_image<unsigned char>
4.77% tf_cnn_benchmar [unknown] [k] 0xffffffff8184ef5a
3.94% tf_cnn_benchmar libcuda.so.384.125 <li> 0x00000000002c2726
3.66% tf_cnn_benchmar [vdso] <li> __vdso_clock_gettime
3.02% tf_cnn_benchmar libtensorflow_framework.so <li> Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop
2.81% tf_cnn_benchmar [unknown] [k] 0xffffffff8184e9b5
1.91% tf_cnn_benchmar [unknown] [k] 0xffffffff8140c717
1.78% tf_cnn_benchmar libcuda.so.384.125 <li> 0x0000000000286922
1.64% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> tensorflow::AdjustContrastOpv2<Eigen::ThreadPoolDevice>::DoCompute
1.43% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> tensorflow::ClipOp<Eigen::ThreadPoolDevice, float>::Compute
1.14% tf_cnn_benchmar libc-2.23.so <li> 0x000000000014e156
0.98% tf_cnn_benchmar libcuda.so.384.125 <li> 0x00000000001f3436
0.92% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> jsimd_ycc_rgb_convert_sse2.columnloop
0.91% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> std::_Function_handler<void (long, long), Eigen::internal::TensorExecutor<Eigen::TensorAssignOp<Eigen::TensorChippingOp<0l,
0.84% tf_cnn_benchmar libcuda.so.384.125 <li> 0x00000000001f3448
0.80% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> jsimd_idct_ifast_sse2.column_end
0.73% tf_cnn_benchmar [unknown] [k] 0xffffffff8106c6bb
0.67% tf_cnn_benchmar [unknown] [k] 0xffffffff81055e82
0.64% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> std::_Function_handler<void (long long, long long), void tensorflow::(anonymous namespace)::ReverseRows<unsigned char, 3>(te
0.62% tf_cnn_benchmar libc-2.23.so <li> malloc
0.59% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> decompress_onepass
0.58% tf_cnn_benchmar libcuda.so.384.125 <li> 0x00000000002c2724
0.57% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> jsimd_idct_ifast_sse2.columnDCT
For a higher level overview, try: perf report --sort comm,dso
On host:
Samples: 639K of event 'cpu-clock', Event count (approx.): 159802000000
Overhead Command Shared Object Symbol
18.41% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> decode_mcu
14.38% tf_cnn_benchmar _distort_image_ops.so <li> std::_Function_handler<void (long long, long long), tensorflow::AdjustHsvInYiqOp<Eigen::ThreadPoolDevice>::DoCompute(tensorflow::
7.61% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> tensorflow::(anonymous namespace)::resize_image<unsigned char>
3.63% tf_cnn_benchmar libtensorflow_framework.so <li> Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop
3.26% tf_cnn_benchmar [kernel.kallsyms] [k] __lock_text_start
2.58% tf_cnn_benchmar libc-2.23.so <li> __memcpy_avx_unaligned
2.14% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> tensorflow::AdjustContrastOpv2<Eigen::ThreadPoolDevice>::DoCompute
1.50% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> tensorflow::ClipOp<Eigen::ThreadPoolDevice, float>::Compute
1.47% tf_cnn_benchmar [kernel.kallsyms] [k] entry_SYSCALL_64_after_swapgs
1.36% python [kernel.kallsyms] [k] __schedule
1.25% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> jsimd_ycc_rgb_convert_sse2.columnloop
1.06% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> std::_Function_handler<void (long, long), Eigen::internal::TensorExecutor<Eigen::TensorAssignOp<Eigen::TensorChippingOp<0l, Eigen
1.05% tf_cnn_benchmar [kernel.kallsyms] [k] __do_page_fault
1.05% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> jsimd_idct_ifast_sse2.column_end
1.04% tf_cnn_benchmar libc-2.23.so <li> __memset_avx2
1.01% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> Eigen::TensorEvaluator<Eigen::TensorSlicingOp<Eigen::array<long, 1ul> const, Eigen::array<long, 1ul> const, Eigen::TensorSlicingO
0.91% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> Eigen::TensorEvaluator<Eigen::TensorSlicingOp<Eigen::array<long, 1ul> const, Eigen::array<long, 1ul> const, Eigen::TensorSlicingO
0.87% tf_cnn_benchmar [kernel.kallsyms] [k] default_send_IPI_mask_sequence_phys
0.87% tf_cnn_benchmar [kernel.kallsyms] [k] clear_page_c_e
0.86% tf_cnn_benchmar libc-2.23.so <li> _int_free
0.81% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> std::_Function_handler<void (long long, long long), void tensorflow::(anonymous namespace)::ReverseRows<unsigned char, 3>(tensorf
0.78% python [kernel.kallsyms] [k] entry_SYSCALL_64_after_swapgs
0.76% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> decompress_onepass
0.74% tf_cnn_benchmar _pywrap_tensorflow_internal.so <li> jsimd_idct_ifast_sse2.columnDCT
0.72% tf_cnn_benchmar libc-2.23.so <li> malloc
For a higher level overview, try: perf report --sort comm,dso
Any one knows the reason of this difference? Or should I do some settings in NGC container to get a better performance?