We tried to reproduce DeepStream SKD 5.0 performance results from the samples by running up to 30 1080p streams using the config file source30_1080p_dec_infer-resnet_tiled_display_int8.txt across multiple GPUs as mentioned below:
• Hardware Platform (dGPU): Tesla V100 and Tesla T4
• DeepStream Version 5.0
• TensorRT Version 7.0
• NVIDIA GPU Driver Version 450.51
Using the command: $ deepstream-app -c /opt/nvidia/deepstream/deepstream-5.0/samples/configs/deepstream-app/source30_1080p_dec_infer-resnet_tiled_display_int8.txt
Tesla T4 Results:
Using a single instance of the command mentioned above, The 30 streams were running @~30 FPS while GPU Utilization was ~35%.
Using two instances of the command, The 60 streams were running @~18.5 FPs while GPU Utilization was ~35%. Why was the GPU capped although the card did not max out?
Tesla V100 Results:
Using a single instance of the command mentioned above, The 30 streams were running @~20 FPS while GPU Utilization was ~25%.
Using two instances of the command, The 60 streams were running @~10 FPs while GPU Utilization was ~25%.
I have the following questions:
Why was the Tesla V100 beaten by the the Tesla T4 although the Tesla V100 has double the tensor cores of the Tesla T4?
Why did the Tesla T4 cap the performance while at ~35% Utilization only? where is the bottleneck? Does it have anything to do with the model being run in int8 mode or compute capability?
did you do any changes to /opt/nvidia/deepstream/deepstream-5.0/samples/configs/deepstream-app/source30_1080p_dec_infer-resnet_tiled_display_int8.txt ?
For GPU/decoding perf test, please change the output for fakesink to avoid the fps being affected by the display which is normally only 30 or 60 fps.
Our tests were performed on AWS machines. and we used the same hard drive in both tests and the same configuration of source30_1080p_dec_infer-resnet_tiled_display_int8.txt with fakesink and there is no attached display.
One thing to note though, we ran the bandwidth of the CUDA Samples on both machines and HtoD and DtoH bandwidth maxed out at only 6.6 GB/s which confirms that the PCIE bus was running at x8. However, it doesn’t explain how or why the V100 was outperformed by the T4.
I will do the complete checkup mentioned. Yet here’s one thing I already know about for sure, The PCIE buses on AWS were working on PCIE 3.0 X 8 with max bandwidth of 6.6 GB/s (measured via CUDA Utility Bandwidth Sample) which is measured on both Machines (The Tesla T4 and V100 Machines) however, the V100 was still outperformed despite running in the same conditions as the T4.