Difference in performace for parallell decode encode with ffmpeg h264_cuvid and h264_nvenc Tesla P100

Hi,

I have made performance measures for decoding and encoding h264 video with up to 8 parallell threads on Tesla P100.
The test runs on Ubuntu 17.04 with driver nvidia 384.90, from ffmpeg with codecs h264_cuvid to decode and h264_nvenc to encode (average time per frame in milliseconds):

For threads

  • 1 : decode 0.5 encode 0.4
  • 2 : decode 0.9 encode 0.4
  • 3 : decode 1.5 encode 0.4
  • 4 : decode 2.0 encode 0.4
  • 5 : decode 2.6 encode 0.4
  • 6 : decode 3.2 encode 0.5
  • 7 : decode 3.7 encode 0.6
  • 8 : decode 4.3 encode 0.6

So the decoding time increases linear with the threads, the encoding time stays more or less stable.
For 9 threads, I get the error ‘No NVENC capable devices found’

nvidia-smi during the encoding with 8 threads:

Timestamp : Tue Nov 14 10:30:25 2017
Driver Version : 384.90

Attached GPUs : 1
GPU 00000000:00:08.0
FB Memory Usage
Total : 16276 MiB
Used : 5307 MiB
Free : 10969 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 47 %
Memory : 26 %
Encoder : 51 %
Decoder : 100 %
GPU Utilization Samples
Duration : 18446744073709.22 sec
Number of Samples : 99
Max : 51 %
Min : 0 %
Avg : 0 %
Memory Utilization Samples
Duration : 18446744073709.22 sec
Number of Samples : 99
Max : 0 %
Min : 0 %
Avg : 0 %
ENC Utilization Samples
Duration : 18446744073709.22 sec
Number of Samples : 99
Max : 51 %
Min : 0 %
Avg : 0 %
DEC Utilization Samples
Duration : 18446744073709.22 sec
Number of Samples : 99
Max : 99 %
Min : 0 %
Avg : 0 %

Question: Is the decoding performance a normal behaviour on Tesla P100 ?

Thanks and regards, Haye