[Linux] NVCuvid - Performarce

nickcis · August 21, 2015, 9:17pm

Hi, i’m building an application using the nvcuvid api for decoding 1920 interlaced h264 video streams.

I’ve succesfully achieve demuxing a transport stream using ffmpeg, sending the data to the VideoParser and then to the VideoDecoder. I correctly get the raw nv12 frame, and everything seems to work correctly.

Using the nvidia-setting application i can meassure the gpu usage (GPU Utilization and Video Engine Utilization).

With a Geforce GTX 740, i’m only able to decode 4 h264 interlace hd streams, with a GPU Utilization of 10% and a Video Engine Utilization of 90%. I’ve tried changing cuvid settings like the CUVIDDECODECREATEINFO::ulCreationFlags, CUVIDDECODECREATEINFO::ulNumDecodeSurfaces, CUVIDDECODECREATEINFO::ulNumOutputSurface and CUVIDPARSEPARAMS::ulMaxDisplayDelay, but everything gives me similiar results.

In addition, i’ve benchmark the gpu using the 3_Imaging/cudaDecodeGl sample. Testing with the flags -nointerop -decodecuda and -nointerop -decodecuvid gives me similar results: an average of 270fps.

I’ve bought another card: Geforce GTX 780 TI with the hopes of getting better results, but i get the same perfomace of the 740.

What are the differences between GPU utilization and Video Engine Utilization? Why i am getting a high Video Engine Utilization and practically no GPU utilization?.

Is there something i’m missing in order to improve the nvcuvid’s performace?.

How i could know what card is the best for decoding purposes? Because there is a huge price difference between the two cards i’ve used (Geforce 740 vs Geforce GTX 780 TI) while i’m getting the same performance in the decoding process.

Thanks in advance!

little_jimmy · August 24, 2015, 5:05am

“What are the differences between GPU utilization and Video Engine Utilization? Why i am getting a high Video Engine Utilization and practically no GPU utilizatio”

because, seemingly, most of your work is video-related and in ‘video-format’
interlaced h264
ffmpeg
raw nv12 frame
VideoParser
VideoDecoder

the ‘gpgpu’ part would be grossly dis-interested

i am no lord video, but you seem to have numerous video streams
hence, are you sure 1 device is sufficient for the load, or should you not perhaps consider multiple devices?

nickcis · August 24, 2015, 5:34pm

You are right, my idea is to decode several video streams. I’m considering using several video cards. But, my problem is what nvidia card is the best (in terms of performance-price relation) to achieve this.

Using nvcuvid (decoding h264), i’ve achieve the same performance using a Geforce 740 and a Geforce GTX 780 TI, which are cards with a huge price difference. How could i know which Nvidia card is the best for decoding (and encoding) video?. Should i use a nvidia card of the quadro series? Which one?.

Thanks!

Edit:
In addition, should i expect a difference in performance using the different CUVIDDECODECREATEINFO::ulCreationFlags ?. In a centos 7 with lastest driver i’m getting the same performance using cudaVideoCreate_Default, cudaVideoCreate_PreferCuda and cudaVideoCreate_PreferCUVID. Is this option a windows specific one?.

little_jimmy · August 25, 2015, 5:03am

again, i am no lord video; but, the idea i get is that you process predominantly in video-format domain, not entirely pure digital/ logic domain - you seem to predominantly process data (already) in video-related format, and not ‘loose’, digital format

i am not altogether certain how the video-specific performance or muscle of devices are measured, but when you look at the general processing muscle of the mentioned devices, you would quickly note that the 780 has plenty, including double precision
and that may constitute buying what you do not need - you are perhaps buying far more general processing power than you really need for the application
looking at the quadro line is not a bad idea - perhaps the data sheets can tell you in more detail how the video processing power is rated, such that you can start to compare cost per muscle across devices
once you spot the method of measure, you can easily cross compare the measure across devices

with regards to the different apis - i honestly do not know
the documentation on these apis should provide you with some background information, that might help to answer the question
apart from that, experimentation seems the best approach, should you fail to obtain a precise answer

little_jimmy · August 25, 2015, 5:03am

again, i am no lord video; but, the idea i get is that you process predominantly in video-format domain, not entirely pure digital/ logic domain - you seem to predominantly process data (already) in video-related format, and not ‘loose’, digital format

i am not altogether certain how the video-specific performance or muscle of devices are measured, but when you look at the general processing muscle of the mentioned devices, you would quickly note that the 780 has plenty, including double precision
and that may constitute buying what you do not need - you are perhaps buying far more general processing power than you really need for the application
looking at the quadro line is not a bad idea - perhaps the data sheets can tell you in more detail how the video processing power is rated, such that you can start to compare cost per muscle across devices
once you spot the method of measure, you can easily cross compare the measure across devices

with regards to the different apis - i honestly do not know
the documentation on these apis should provide you with some background information, that might help to answer the question
apart from that, experimentation seems the best approach, should you fail to obtain a precise answer

little_jimmy · August 25, 2015, 5:06am

again, i am no lord video; but, the idea i get is that you process predominantly in video-format domain, not entirely pure digital/ logic domain - you seem to predominantly process data (already) in video-related format, and not ‘loose’, digital format

i am not altogether certain how the video-specific performance or muscle of devices are measured, but when you look at the general processing muscle of the mentioned devices, you would quickly note that the 780 has plenty, including double precision
and that may constitute buying what you do not need - you are perhaps buying far more general processing power than you really need for the application
looking at the quadro line is not a bad idea - perhaps the data sheets can tell you in more detail how the video processing power is rated, such that you can start to compare cost per muscle across devices
once you spot the method of measure, you can easily cross compare the measure across devices

with regards to the different apis - i honestly do not know
the documentation on these apis should provide you with some background information, that might help to answer the question
apart from that, experimentation seems the best approach, should you fail to obtain a precise answer

nickcis · August 25, 2015, 4:03pm

I’m measuring perfomance using the sample 3_Imaging/cudaDecodeGl.
The results are these:

The video used in the test is a full hd h264 mp4

$ ffmpeg -i /mnt/A/TS/video.mp4 
ffmpeg version 1.0.6_patch-aac-resample-lock Copyright (c) 2000-2013 the FFmpeg developers
  built on Jun  2 2015 17:00:42 with gcc 4.8.3 (GCC) 20140911 (Red Hat 4.8.3-9)
  configuration: --enable-libvpx --enable-shared --prefix=/usr --enable-libtheora --enable-postproc --enable-gpl --enable-libmp3lame --enable-libvorbis --enable-libx264 --enable-libfdk_aac --enable-nonfree --libdir=/usr/lib64 --shlibdir=/usr/lib64
  libavutil      51. 73.101 / 51. 73.101
  libavcodec     54. 59.100 / 54. 59.100
  libavformat    54. 29.104 / 54. 29.104
  libavdevice    54.  2.101 / 54.  2.101
  libavfilter     3. 17.100 /  3. 17.100
  libswscale      2.  1.101 /  2.  1.101
  libswresample   0. 15.100 /  0. 15.100
  libpostproc    52.  0.100 / 52.  0.100
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x72c240] multiple edit list entries, a/v desync might occur, patch welcome
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/mnt/A/TS/video.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf54.29.104
  Duration: 00:00:28.66, start: 1.533000, bitrate: 13610 kb/s
    Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 1920x1080 [SAR 1:1 DAR 16:9], 13607 kb/s, 29.97 fps, 29.97 tbr, 90k tbn, 59.94 tbc
    Metadata:
      handler_name    : VideoHandler

GeForce GTX 780
Using preferCudacuda:

$ ./cudaDecodeGL -nointerop -decodecuda -device=0 /mnt/A/TS/video.mp4
[CUDA/OpenGL Video Decode]
Command Line Arguments:
argv[0] = ./cudaDecodeGL
argv[1] = -nointerop
argv[2] = -decodecuda
argv[3] = -device=0
argv[4] = /mnt/A/TS/video.mp4
[cudaDecodeGL]: input file: </mnt/A/TS/video.mp4>
	VideoCodec      : AVC/H.264
	Frame rate      : 30000/1001fps ~ 29.97fps
	Sequence format : Interlaced
	Coded frame size: [1920, 1088]
	Display area    : [0, 0, 1920, 1080]
	Chroma format   : 4:2:0
	Bitrate         : unknown
	Aspect ratio    : 16:9


argv[0] = ./cudaDecodeGL
argv[1] = -nointerop
argv[2] = -decodecuda
argv[3] = -device=0
argv[4] = /mnt/A/TS/video.mp4

gpuDeviceInitDRV() Using CUDA Device [0]: GeForce GTX 780 Ti
gpuDeviceInitDRV() Using CUDA Device [0]: GeForce GTX 780 Ti
> Using GPU Device: GeForce GTX 780 Ti has SM 3.5 compute capability
  Total amount of global memory:     3071.3125 MB
>> modInitCTX<NV12ToARGB_drvapi64.ptx > initialized OK
>> modGetCudaFunction< CUDA file:              NV12ToARGB_drvapi64.ptx >
   CUDA Kernel Function (0x01dd64b0) = <   NV12ToARGB_drvapi >
>> modGetCudaFunction< CUDA file:              NV12ToARGB_drvapi64.ptx >
   CUDA Kernel Function (0x01ddd540) = <     Passthru_drvapi >
  Free memory:     2847.5508 MB
> VideoDecoder::cudaVideoCreateFlags = <1>Use CUDA decoder

[cudaDecodeGL] - [Field: 0016, 00.0 fps, frame time: 90032283648.00 (ms) ]
[cudaDecodeGL] - [Field: 0032, 136.3 fps, frame time: 7.34 (ms) ]
[cudaDecodeGL] - [Field: 0048, 136.3 fps, frame time: 7.33 (ms) ]
[cudaDecodeGL] - [Field: 0064, 137.0 fps, frame time: 7.30 (ms) ]
[cudaDecodeGL] - [Field: 0080, 138.2 fps, frame time: 7.24 (ms) ]
[cudaDecodeGL] - [Field: 0096, 134.5 fps, frame time: 7.44 (ms) ]
[cudaDecodeGL] - [Field: 0112, 133.7 fps, frame time: 7.48 (ms) ]
[cudaDecodeGL] - [Field: 0128, 132.5 fps, frame time: 7.55 (ms) ]
[cudaDecodeGL] - [Field: 0144, 134.9 fps, frame time: 7.41 (ms) ]
[cudaDecodeGL] - [Field: 0160, 133.0 fps, frame time: 7.52 (ms) ]
[cudaDecodeGL] - [Field: 0176, 137.1 fps, frame time: 7.29 (ms) ]
[cudaDecodeGL] - [Field: 0192, 141.6 fps, frame time: 7.06 (ms) ]
[cudaDecodeGL] - [Field: 0208, 137.7 fps, frame time: 7.26 (ms) ]
[cudaDecodeGL] - [Field: 0224, 139.0 fps, frame time: 7.19 (ms) ]
[cudaDecodeGL] - [Field: 0240, 133.9 fps, frame time: 7.47 (ms) ]
[cudaDecodeGL] - [Field: 0256, 136.6 fps, frame time: 7.32 (ms) ]
[cudaDecodeGL] - [Field: 0272, 138.7 fps, frame time: 7.21 (ms) ]
[cudaDecodeGL] - [Field: 0288, 135.7 fps, frame time: 7.37 (ms) ]
[cudaDecodeGL] - [Field: 0304, 136.8 fps, frame time: 7.31 (ms) ]
[cudaDecodeGL] - [Field: 0320, 140.1 fps, frame time: 7.14 (ms) ]
[cudaDecodeGL] - [Field: 0336, 157.4 fps, frame time: 6.35 (ms) ]
[cudaDecodeGL] - [Field: 0352, 133.9 fps, frame time: 7.47 (ms) ]
[cudaDecodeGL] - [Field: 0368, 133.3 fps, frame time: 7.50 (ms) ]
[cudaDecodeGL] - [Field: 0384, 134.0 fps, frame time: 7.46 (ms) ]
[cudaDecodeGL] - [Field: 0400, 133.7 fps, frame time: 7.48 (ms) ]
[cudaDecodeGL] - [Field: 0416, 134.1 fps, frame time: 7.46 (ms) ]
[cudaDecodeGL] - [Field: 0432, 133.3 fps, frame time: 7.50 (ms) ]
[cudaDecodeGL] - [Field: 0448, 136.5 fps, frame time: 7.32 (ms) ]
[cudaDecodeGL] - [Field: 0464, 134.8 fps, frame time: 7.42 (ms) ]
[cudaDecodeGL] - [Field: 0480, 139.5 fps, frame time: 7.17 (ms) ]
[cudaDecodeGL] - [Field: 0496, 141.4 fps, frame time: 7.07 (ms) ]
[cudaDecodeGL] - [Field: 0512, 139.8 fps, frame time: 7.15 (ms) ]
[cudaDecodeGL] - [Field: 0528, 139.2 fps, frame time: 7.18 (ms) ]
[cudaDecodeGL] - [Field: 0544, 138.3 fps, frame time: 7.23 (ms) ]
[cudaDecodeGL] - [Field: 0560, 131.3 fps, frame time: 7.62 (ms) ]
[cudaDecodeGL] - [Field: 0576, 146.5 fps, frame time: 6.83 (ms) ]
[cudaDecodeGL] - [Field: 0592, 144.2 fps, frame time: 6.94 (ms) ]
[cudaDecodeGL] - [Field: 0608, 138.4 fps, frame time: 7.22 (ms) ]
[cudaDecodeGL] - [Field: 0624, 155.3 fps, frame time: 6.44 (ms) ]
[cudaDecodeGL] - [Field: 0640, 145.9 fps, frame time: 6.85 (ms) ]
[cudaDecodeGL] - [Field: 0656, 146.6 fps, frame time: 6.82 (ms) ]
[cudaDecodeGL] - [Field: 0672, 143.7 fps, frame time: 6.96 (ms) ]
[cudaDecodeGL] - [Field: 0688, 141.3 fps, frame time: 7.07 (ms) ]
[cudaDecodeGL] - [Field: 0704, 141.2 fps, frame time: 7.08 (ms) ]
[cudaDecodeGL] - [Field: 0720, 144.1 fps, frame time: 6.94 (ms) ]
[cudaDecodeGL] - [Field: 0736, 145.9 fps, frame time: 6.85 (ms) ]
[cudaDecodeGL] - [Field: 0752, 145.1 fps, frame time: 6.89 (ms) ]
[cudaDecodeGL] - [Field: 0768, 146.2 fps, frame time: 6.84 (ms) ]
[cudaDecodeGL] - [Field: 0784, 147.2 fps, frame time: 6.79 (ms) ]
[cudaDecodeGL] - [Field: 0800, 140.1 fps, frame time: 7.14 (ms) ]
[cudaDecodeGL] - [Field: 0816, 141.9 fps, frame time: 7.05 (ms) ]
[cudaDecodeGL] - [Field: 0832, 125.2 fps, frame time: 7.99 (ms) ]
[cudaDecodeGL] - [Field: 0848, 130.5 fps, frame time: 7.66 (ms) ]

[cudaDecodeGL] statistics
	 Video Length (hh:mm:ss.msec)   = 00:00:06.173
	 Frames Presented (inc repeats) = 856
	 Average Present Rate     (fps) = 138.65
	 Frames Decoded   (hardware)    = 1712
	 Average Rate of Decoding (fps) = 277.30

Using preferCuvid

$ ./cudaDecodeGL -nointerop -decodecuvid -device=0 /mnt/A/TS/video.mp4
[CUDA/OpenGL Video Decode]
Command Line Arguments:
argv[0] = ./cudaDecodeGL
argv[1] = -nointerop
argv[2] = -decodecuvid
argv[3] = -device=0
argv[4] = /mnt/A/TS/video.mp4
[cudaDecodeGL]: input file: </mnt/A/TS/video.mp4>
	VideoCodec      : AVC/H.264
	Frame rate      : 30000/1001fps ~ 29.97fps
	Sequence format : Interlaced
	Coded frame size: [1920, 1088]
	Display area    : [0, 0, 1920, 1080]
	Chroma format   : 4:2:0
	Bitrate         : unknown
	Aspect ratio    : 16:9


argv[0] = ./cudaDecodeGL
argv[1] = -nointerop
argv[2] = -decodecuvid
argv[3] = -device=0
argv[4] = /mnt/A/TS/video.mp4

gpuDeviceInitDRV() Using CUDA Device [0]: GeForce GTX 780 Ti
gpuDeviceInitDRV() Using CUDA Device [0]: GeForce GTX 780 Ti
> Using GPU Device: GeForce GTX 780 Ti has SM 3.5 compute capability
  Total amount of global memory:     3071.3125 MB
>> modInitCTX<NV12ToARGB_drvapi64.ptx > initialized OK
>> modGetCudaFunction< CUDA file:              NV12ToARGB_drvapi64.ptx >
   CUDA Kernel Function (0x0140c150) = <   NV12ToARGB_drvapi >
>> modGetCudaFunction< CUDA file:              NV12ToARGB_drvapi64.ptx >
   CUDA Kernel Function (0x0140f2e0) = <     Passthru_drvapi >
  Free memory:     2847.0508 MB
> VideoDecoder::cudaVideoCreateFlags = <4>Use CUVID decoder

[cudaDecodeGL] - [Field: 0016, 00.0 fps, frame time: 90032283648.00 (ms) ]
[cudaDecodeGL] - [Field: 0032, 136.7 fps, frame time: 7.32 (ms) ]
[cudaDecodeGL] - [Field: 0048, 136.7 fps, frame time: 7.32 (ms) ]
[cudaDecodeGL] - [Field: 0064, 136.6 fps, frame time: 7.32 (ms) ]
[cudaDecodeGL] - [Field: 0080, 137.9 fps, frame time: 7.25 (ms) ]
[cudaDecodeGL] - [Field: 0096, 135.3 fps, frame time: 7.39 (ms) ]
[cudaDecodeGL] - [Field: 0112, 133.9 fps, frame time: 7.47 (ms) ]
[cudaDecodeGL] - [Field: 0128, 132.3 fps, frame time: 7.56 (ms) ]
[cudaDecodeGL] - [Field: 0144, 135.2 fps, frame time: 7.40 (ms) ]
[cudaDecodeGL] - [Field: 0160, 133.2 fps, frame time: 7.51 (ms) ]
[cudaDecodeGL] - [Field: 0176, 136.4 fps, frame time: 7.33 (ms) ]
[cudaDecodeGL] - [Field: 0192, 141.8 fps, frame time: 7.05 (ms) ]
[cudaDecodeGL] - [Field: 0208, 137.9 fps, frame time: 7.25 (ms) ]
[cudaDecodeGL] - [Field: 0224, 139.1 fps, frame time: 7.19 (ms) ]
[cudaDecodeGL] - [Field: 0240, 134.1 fps, frame time: 7.46 (ms) ]
[cudaDecodeGL] - [Field: 0256, 135.9 fps, frame time: 7.36 (ms) ]
[cudaDecodeGL] - [Field: 0272, 138.1 fps, frame time: 7.24 (ms) ]
[cudaDecodeGL] - [Field: 0288, 137.1 fps, frame time: 7.29 (ms) ]
[cudaDecodeGL] - [Field: 0304, 136.8 fps, frame time: 7.31 (ms) ]
[cudaDecodeGL] - [Field: 0320, 138.7 fps, frame time: 7.21 (ms) ]
[cudaDecodeGL] - [Field: 0336, 159.0 fps, frame time: 6.29 (ms) ]
[cudaDecodeGL] - [Field: 0352, 134.0 fps, frame time: 7.46 (ms) ]
[cudaDecodeGL] - [Field: 0368, 133.2 fps, frame time: 7.51 (ms) ]
[cudaDecodeGL] - [Field: 0384, 134.1 fps, frame time: 7.46 (ms) ]
[cudaDecodeGL] - [Field: 0400, 133.3 fps, frame time: 7.50 (ms) ]
[cudaDecodeGL] - [Field: 0416, 134.2 fps, frame time: 7.45 (ms) ]
[cudaDecodeGL] - [Field: 0432, 134.3 fps, frame time: 7.45 (ms) ]
[cudaDecodeGL] - [Field: 0448, 135.7 fps, frame time: 7.37 (ms) ]
[cudaDecodeGL] - [Field: 0464, 135.9 fps, frame time: 7.36 (ms) ]
[cudaDecodeGL] - [Field: 0480, 138.3 fps, frame time: 7.23 (ms) ]
[cudaDecodeGL] - [Field: 0496, 142.6 fps, frame time: 7.01 (ms) ]
[cudaDecodeGL] - [Field: 0512, 139.3 fps, frame time: 7.18 (ms) ]
[cudaDecodeGL] - [Field: 0528, 138.4 fps, frame time: 7.23 (ms) ]
[cudaDecodeGL] - [Field: 0544, 138.3 fps, frame time: 7.23 (ms) ]
[cudaDecodeGL] - [Field: 0560, 130.7 fps, frame time: 7.65 (ms) ]
[cudaDecodeGL] - [Field: 0576, 146.8 fps, frame time: 6.81 (ms) ]
[cudaDecodeGL] - [Field: 0592, 144.1 fps, frame time: 6.94 (ms) ]
[cudaDecodeGL] - [Field: 0608, 139.4 fps, frame time: 7.17 (ms) ]
[cudaDecodeGL] - [Field: 0624, 155.0 fps, frame time: 6.45 (ms) ]
[cudaDecodeGL] - [Field: 0640, 146.5 fps, frame time: 6.83 (ms) ]
[cudaDecodeGL] - [Field: 0656, 146.6 fps, frame time: 6.82 (ms) ]
[cudaDecodeGL] - [Field: 0672, 143.4 fps, frame time: 6.97 (ms) ]
[cudaDecodeGL] - [Field: 0688, 142.1 fps, frame time: 7.04 (ms) ]
[cudaDecodeGL] - [Field: 0704, 141.4 fps, frame time: 7.07 (ms) ]
[cudaDecodeGL] - [Field: 0720, 144.2 fps, frame time: 6.93 (ms) ]
[cudaDecodeGL] - [Field: 0736, 146.0 fps, frame time: 6.85 (ms) ]
[cudaDecodeGL] - [Field: 0752, 144.7 fps, frame time: 6.91 (ms) ]
[cudaDecodeGL] - [Field: 0768, 146.9 fps, frame time: 6.81 (ms) ]
[cudaDecodeGL] - [Field: 0784, 145.8 fps, frame time: 6.86 (ms) ]
[cudaDecodeGL] - [Field: 0800, 140.7 fps, frame time: 7.11 (ms) ]
[cudaDecodeGL] - [Field: 0816, 141.1 fps, frame time: 7.09 (ms) ]
[cudaDecodeGL] - [Field: 0832, 125.0 fps, frame time: 8.00 (ms) ]
[cudaDecodeGL] - [Field: 0848, 131.0 fps, frame time: 7.63 (ms) ]

[cudaDecodeGL] statistics
	 Video Length (hh:mm:ss.msec)   = 00:00:06.172
	 Frames Presented (inc repeats) = 856
	 Average Present Rate     (fps) = 138.68
	 Frames Decoded   (hardware)    = 1712
	 Average Rate of Decoding (fps) = 277.36

GT 740

Using preferCuda

$ ./cudaDecodeGL -nointerop -decodecuda -device=1 /mnt/A/TS/video.mp4
[CUDA/OpenGL Video Decode]
Command Line Arguments:
argv[0] = ./cudaDecodeGL
argv[1] = -nointerop
argv[2] = -decodecuda
argv[3] = -device=1
argv[4] = /mnt/A/TS/video.mp4
[cudaDecodeGL]: input file: </mnt/A/TS/video.mp4>
	VideoCodec      : AVC/H.264
	Frame rate      : 30000/1001fps ~ 29.97fps
	Sequence format : Interlaced
	Coded frame size: [1920, 1088]
	Display area    : [0, 0, 1920, 1080]
	Chroma format   : 4:2:0
	Bitrate         : unknown
	Aspect ratio    : 16:9


argv[0] = ./cudaDecodeGL
argv[1] = -nointerop
argv[2] = -decodecuda
argv[3] = -device=1
argv[4] = /mnt/A/TS/video.mp4

gpuDeviceInitDRV() Using CUDA Device [1]: GeForce GT 740
gpuDeviceInitDRV() Using CUDA Device [1]: GeForce GT 740
> Using GPU Device: GeForce GT 740 has SM 3.0 compute capability
  Total amount of global memory:     2047.8125 MB
>> modInitCTX<NV12ToARGB_drvapi64.ptx > initialized OK
>> modGetCudaFunction< CUDA file:              NV12ToARGB_drvapi64.ptx >
   CUDA Kernel Function (0x012027a0) = <   NV12ToARGB_drvapi >
>> modGetCudaFunction< CUDA file:              NV12ToARGB_drvapi64.ptx >
   CUDA Kernel Function (0x01209930) = <     Passthru_drvapi >
  Free memory:     2024.1875 MB
> VideoDecoder::cudaVideoCreateFlags = <1>Use CUDA decoder

[cudaDecodeGL] - [Field: 0016, 00.0 fps, frame time: 90032283648.00 (ms) ]
[cudaDecodeGL] - [Field: 0032, 136.6 fps, frame time: 7.32 (ms) ]
[cudaDecodeGL] - [Field: 0048, 136.9 fps, frame time: 7.31 (ms) ]
[cudaDecodeGL] - [Field: 0064, 136.8 fps, frame time: 7.31 (ms) ]
[cudaDecodeGL] - [Field: 0080, 138.6 fps, frame time: 7.21 (ms) ]
[cudaDecodeGL] - [Field: 0096, 135.1 fps, frame time: 7.40 (ms) ]
[cudaDecodeGL] - [Field: 0112, 134.3 fps, frame time: 7.45 (ms) ]
[cudaDecodeGL] - [Field: 0128, 132.8 fps, frame time: 7.53 (ms) ]
[cudaDecodeGL] - [Field: 0144, 134.7 fps, frame time: 7.42 (ms) ]
[cudaDecodeGL] - [Field: 0160, 133.0 fps, frame time: 7.52 (ms) ]
[cudaDecodeGL] - [Field: 0176, 138.3 fps, frame time: 7.23 (ms) ]
[cudaDecodeGL] - [Field: 0192, 141.0 fps, frame time: 7.09 (ms) ]
[cudaDecodeGL] - [Field: 0208, 138.6 fps, frame time: 7.21 (ms) ]
[cudaDecodeGL] - [Field: 0224, 138.1 fps, frame time: 7.24 (ms) ]
[cudaDecodeGL] - [Field: 0240, 135.1 fps, frame time: 7.40 (ms) ]
[cudaDecodeGL] - [Field: 0256, 136.4 fps, frame time: 7.33 (ms) ]
[cudaDecodeGL] - [Field: 0272, 139.5 fps, frame time: 7.17 (ms) ]
[cudaDecodeGL] - [Field: 0288, 136.0 fps, frame time: 7.35 (ms) ]
[cudaDecodeGL] - [Field: 0304, 137.1 fps, frame time: 7.29 (ms) ]
[cudaDecodeGL] - [Field: 0320, 139.2 fps, frame time: 7.18 (ms) ]
[cudaDecodeGL] - [Field: 0336, 158.8 fps, frame time: 6.30 (ms) ]
[cudaDecodeGL] - [Field: 0352, 134.8 fps, frame time: 7.42 (ms) ]
[cudaDecodeGL] - [Field: 0368, 133.3 fps, frame time: 7.50 (ms) ]
[cudaDecodeGL] - [Field: 0384, 133.5 fps, frame time: 7.49 (ms) ]
[cudaDecodeGL] - [Field: 0400, 134.2 fps, frame time: 7.45 (ms) ]
[cudaDecodeGL] - [Field: 0416, 134.5 fps, frame time: 7.43 (ms) ]
[cudaDecodeGL] - [Field: 0432, 133.6 fps, frame time: 7.49 (ms) ]
[cudaDecodeGL] - [Field: 0448, 136.6 fps, frame time: 7.32 (ms) ]
[cudaDecodeGL] - [Field: 0464, 135.5 fps, frame time: 7.38 (ms) ]
[cudaDecodeGL] - [Field: 0480, 138.5 fps, frame time: 7.22 (ms) ]
[cudaDecodeGL] - [Field: 0496, 143.3 fps, frame time: 6.98 (ms) ]
[cudaDecodeGL] - [Field: 0512, 139.8 fps, frame time: 7.15 (ms) ]
[cudaDecodeGL] - [Field: 0528, 137.9 fps, frame time: 7.25 (ms) ]
[cudaDecodeGL] - [Field: 0544, 138.7 fps, frame time: 7.21 (ms) ]
[cudaDecodeGL] - [Field: 0560, 131.5 fps, frame time: 7.60 (ms) ]
[cudaDecodeGL] - [Field: 0576, 146.9 fps, frame time: 6.81 (ms) ]
[cudaDecodeGL] - [Field: 0592, 143.7 fps, frame time: 6.96 (ms) ]
[cudaDecodeGL] - [Field: 0608, 138.4 fps, frame time: 7.23 (ms) ]
[cudaDecodeGL] - [Field: 0624, 156.4 fps, frame time: 6.39 (ms) ]
[cudaDecodeGL] - [Field: 0640, 147.0 fps, frame time: 6.80 (ms) ]
[cudaDecodeGL] - [Field: 0656, 145.6 fps, frame time: 6.87 (ms) ]
[cudaDecodeGL] - [Field: 0672, 144.7 fps, frame time: 6.91 (ms) ]
[cudaDecodeGL] - [Field: 0688, 142.4 fps, frame time: 7.02 (ms) ]
[cudaDecodeGL] - [Field: 0704, 141.1 fps, frame time: 7.09 (ms) ]
[cudaDecodeGL] - [Field: 0720, 144.9 fps, frame time: 6.90 (ms) ]
[cudaDecodeGL] - [Field: 0736, 145.9 fps, frame time: 6.86 (ms) ]
[cudaDecodeGL] - [Field: 0752, 145.2 fps, frame time: 6.88 (ms) ]
[cudaDecodeGL] - [Field: 0768, 147.1 fps, frame time: 6.80 (ms) ]
[cudaDecodeGL] - [Field: 0784, 146.6 fps, frame time: 6.82 (ms) ]
[cudaDecodeGL] - [Field: 0800, 141.7 fps, frame time: 7.06 (ms) ]
[cudaDecodeGL] - [Field: 0816, 140.4 fps, frame time: 7.12 (ms) ]
[cudaDecodeGL] - [Field: 0832, 125.8 fps, frame time: 7.95 (ms) ]
[cudaDecodeGL] - [Field: 0848, 131.1 fps, frame time: 7.63 (ms) ]

[cudaDecodeGL] statistics
	 Video Length (hh:mm:ss.msec)   = 00:00:06.161
	 Frames Presented (inc repeats) = 856
	 Average Present Rate     (fps) = 138.92
	 Frames Decoded   (hardware)    = 1712
	 Average Rate of Decoding (fps) = 277.85

Using preferCuvid:

$ ./cudaDecodeGL -nointerop -decodecuvid -device=1 /mnt/A/TS/video.mp4
[CUDA/OpenGL Video Decode]
Command Line Arguments:
argv[0] = ./cudaDecodeGL
argv[1] = -nointerop
argv[2] = -decodecuvid
argv[3] = -device=1
argv[4] = /mnt/A/TS/video.mp4
[cudaDecodeGL]: input file: </mnt/A/TS/video.mp4>
	VideoCodec      : AVC/H.264
	Frame rate      : 30000/1001fps ~ 29.97fps
	Sequence format : Interlaced
	Coded frame size: [1920, 1088]
	Display area    : [0, 0, 1920, 1080]
	Chroma format   : 4:2:0
	Bitrate         : unknown
	Aspect ratio    : 16:9


argv[0] = ./cudaDecodeGL
argv[1] = -nointerop
argv[2] = -decodecuvid
argv[3] = -device=1
argv[4] = /mnt/A/TS/video.mp4

gpuDeviceInitDRV() Using CUDA Device [1]: GeForce GT 740
gpuDeviceInitDRV() Using CUDA Device [1]: GeForce GT 740
> Using GPU Device: GeForce GT 740 has SM 3.0 compute capability
  Total amount of global memory:     2047.8125 MB
>> modInitCTX<NV12ToARGB_drvapi64.ptx > initialized OK
>> modGetCudaFunction< CUDA file:              NV12ToARGB_drvapi64.ptx >
   CUDA Kernel Function (0x018e8c40) = <   NV12ToARGB_drvapi >
>> modGetCudaFunction< CUDA file:              NV12ToARGB_drvapi64.ptx >
   CUDA Kernel Function (0x018ebdd0) = <     Passthru_drvapi >
  Free memory:     2024.1875 MB
> VideoDecoder::cudaVideoCreateFlags = <4>Use CUVID decoder

[cudaDecodeGL] - [Field: 0016, 00.0 fps, frame time: 90032308224.00 (ms) ]
[cudaDecodeGL] - [Field: 0032, 136.9 fps, frame time: 7.30 (ms) ]
[cudaDecodeGL] - [Field: 0048, 136.3 fps, frame time: 7.34 (ms) ]
[cudaDecodeGL] - [Field: 0064, 137.7 fps, frame time: 7.26 (ms) ]
[cudaDecodeGL] - [Field: 0080, 137.8 fps, frame time: 7.26 (ms) ]
[cudaDecodeGL] - [Field: 0096, 136.3 fps, frame time: 7.34 (ms) ]
[cudaDecodeGL] - [Field: 0112, 133.1 fps, frame time: 7.52 (ms) ]
[cudaDecodeGL] - [Field: 0128, 132.7 fps, frame time: 7.53 (ms) ]
[cudaDecodeGL] - [Field: 0144, 136.1 fps, frame time: 7.35 (ms) ]
[cudaDecodeGL] - [Field: 0160, 133.2 fps, frame time: 7.51 (ms) ]
[cudaDecodeGL] - [Field: 0176, 137.5 fps, frame time: 7.27 (ms) ]
[cudaDecodeGL] - [Field: 0192, 141.0 fps, frame time: 7.09 (ms) ]
[cudaDecodeGL] - [Field: 0208, 138.7 fps, frame time: 7.21 (ms) ]
[cudaDecodeGL] - [Field: 0224, 138.1 fps, frame time: 7.24 (ms) ]
[cudaDecodeGL] - [Field: 0240, 135.1 fps, frame time: 7.40 (ms) ]
[cudaDecodeGL] - [Field: 0256, 135.3 fps, frame time: 7.39 (ms) ]
[cudaDecodeGL] - [Field: 0272, 139.5 fps, frame time: 7.17 (ms) ]
[cudaDecodeGL] - [Field: 0288, 137.2 fps, frame time: 7.29 (ms) ]
[cudaDecodeGL] - [Field: 0304, 135.7 fps, frame time: 7.37 (ms) ]
[cudaDecodeGL] - [Field: 0320, 140.0 fps, frame time: 7.14 (ms) ]
[cudaDecodeGL] - [Field: 0336, 160.1 fps, frame time: 6.24 (ms) ]
[cudaDecodeGL] - [Field: 0352, 133.9 fps, frame time: 7.47 (ms) ]
[cudaDecodeGL] - [Field: 0368, 133.6 fps, frame time: 7.49 (ms) ]
[cudaDecodeGL] - [Field: 0384, 133.8 fps, frame time: 7.47 (ms) ]
[cudaDecodeGL] - [Field: 0400, 135.0 fps, frame time: 7.41 (ms) ]
[cudaDecodeGL] - [Field: 0416, 133.9 fps, frame time: 7.47 (ms) ]
[cudaDecodeGL] - [Field: 0432, 133.0 fps, frame time: 7.52 (ms) ]
[cudaDecodeGL] - [Field: 0448, 136.8 fps, frame time: 7.31 (ms) ]
[cudaDecodeGL] - [Field: 0464, 135.4 fps, frame time: 7.39 (ms) ]
[cudaDecodeGL] - [Field: 0480, 139.4 fps, frame time: 7.18 (ms) ]
[cudaDecodeGL] - [Field: 0496, 142.7 fps, frame time: 7.01 (ms) ]
[cudaDecodeGL] - [Field: 0512, 138.9 fps, frame time: 7.20 (ms) ]
[cudaDecodeGL] - [Field: 0528, 138.9 fps, frame time: 7.20 (ms) ]
[cudaDecodeGL] - [Field: 0544, 139.1 fps, frame time: 7.19 (ms) ]
[cudaDecodeGL] - [Field: 0560, 131.7 fps, frame time: 7.59 (ms) ]
[cudaDecodeGL] - [Field: 0576, 146.4 fps, frame time: 6.83 (ms) ]
[cudaDecodeGL] - [Field: 0592, 145.6 fps, frame time: 6.87 (ms) ]
[cudaDecodeGL] - [Field: 0608, 137.6 fps, frame time: 7.27 (ms) ]
[cudaDecodeGL] - [Field: 0624, 156.5 fps, frame time: 6.39 (ms) ]
[cudaDecodeGL] - [Field: 0640, 146.5 fps, frame time: 6.83 (ms) ]
[cudaDecodeGL] - [Field: 0656, 146.3 fps, frame time: 6.84 (ms) ]
[cudaDecodeGL] - [Field: 0672, 144.0 fps, frame time: 6.95 (ms) ]
[cudaDecodeGL] - [Field: 0688, 142.4 fps, frame time: 7.02 (ms) ]
[cudaDecodeGL] - [Field: 0704, 142.0 fps, frame time: 7.04 (ms) ]
[cudaDecodeGL] - [Field: 0720, 143.3 fps, frame time: 6.98 (ms) ]
[cudaDecodeGL] - [Field: 0736, 146.4 fps, frame time: 6.83 (ms) ]
[cudaDecodeGL] - [Field: 0752, 145.2 fps, frame time: 6.89 (ms) ]
[cudaDecodeGL] - [Field: 0768, 147.4 fps, frame time: 6.78 (ms) ]
[cudaDecodeGL] - [Field: 0784, 146.2 fps, frame time: 6.84 (ms) ]
[cudaDecodeGL] - [Field: 0800, 141.5 fps, frame time: 7.07 (ms) ]
[cudaDecodeGL] - [Field: 0816, 141.4 fps, frame time: 7.07 (ms) ]
[cudaDecodeGL] - [Field: 0832, 125.3 fps, frame time: 7.98 (ms) ]
[cudaDecodeGL] - [Field: 0848, 130.4 fps, frame time: 7.67 (ms) ]

[cudaDecodeGL] statistics
	 Video Length (hh:mm:ss.msec)   = 00:00:06.160
	 Frames Presented (inc repeats) = 856
	 Average Present Rate     (fps) = 138.94
	 Frames Decoded   (hardware)    = 1712
	 Average Rate of Decoding (fps) = 277.88

I’m getting the same performance (decoding measured in fps) in both cards using both cuda and cuvid.

Although the GTX780TI has 2880 cuda cores a huge number compared to the 384 cores of the GT 740 i achieve the same decoding fps.

I’ve bought the GTX 780TI expecting an improvement in decoding performance, the problem is that i didn’t get it. Am i forgetting something to configure?.

Will i get more decoding fps using a Quadro or Tesla card?. Which should i buy?. Are there any specification about this?.

If it’s needed, this is the ouput of the deciveQuery:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "GeForce GTX 780 Ti"
  CUDA Driver Version / Runtime Version          7.5 / 7.0
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 3071 MBytes (3220504576 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            1046 MHz (1.05 GHz)
  Memory Clock rate:                             3500 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce GT 740"
  CUDA Driver Version / Runtime Version          7.5 / 7.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 2048 MBytes (2147287040 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            1072 MHz (1.07 GHz)
  Memory Clock rate:                             2500 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce GTX 780 Ti (GPU0) -> GeForce GT 740 (GPU1) : No
> Peer access from GeForce GT 740 (GPU1) -> GeForce GTX 780 Ti (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.0, NumDevs = 2, Device0 = GeForce GTX 780 Ti, Device1 = GeForce GT 740
Result = PASS

little_jimmy · August 26, 2015, 1:31pm

the forum not at all behaving yesterday

i quickly browsed the quadro data sheets; 5 minutes into the exercise, i realized how light weight most of the quadro cards seemed in terms of cuda cores

now, from the (very first page of the) cuda video decoder pdf:

“The CUDA Video Decoder API gives developers access to hardware video decoding capabilities on NVIDIA GPU. The actual hardware decode can run on either Video Processor (VP) or CUDA hardware, depending on the hardware capabilities and the codecs.”

my view is that the paragraph can be read in a number of ways
but, i take the ‘video processor’ to mean ‘that right of the frame buffer’
and when you hardly see a difference when specifying different settings, i take it that decoding is done via the VP, rather than cuda cores (or there is some other ‘overhead’ or peripheral functions/ consumption that is registered under the vp)
the quoted paragraph may imply that you may or may not have success attempting to shift decoding from the vp to cuda

hence, the preferred device seems to pivot on the reason why decoding is done via the vp, and whether it can be shifted to cuda
also, the current bottleneck the code/ application faces at present, with decoding presumably via the vp
what exactly falls under vp is ambiguous too, in my view

what do you do with the decoded video? do you render it, or not?
and what is the implied memory bandwidth of the pre/ post decoded video?
i am wondering whether vp utility is reported at 90%, merely due to decoding, or also due to the rendering and/ or memory load, perhaps

nickcis · August 26, 2015, 6:57pm

The testing code is the Sample included in the CUDA SDK, it’s under the samples/3_Imaging/cudaDecodeGL folder ( CUDA Samples ).

$ ./cudaDecodeGL --help
[CUDA/OpenGL Video Decode]
Command Line Arguments:
argv[0] = ./cudaDecodeGL
argv[1] = --help

CUDA/OpenGL Video Decode - Help

  cudaDecodeGL [parameters] [video_file]

Program parameters:
        -decodecuda   - Use CUDA for MPEG-2 (Available with 64+ CUDA cores)
        -decodedxva   - Use VP for MPEG-2, VC-1, H.264 decode.
        -decodecuvid  - Use VP for MPEG-2, VC-1, H.264 decode (optimized)
        -vsync        - Enable vertical sync.
        -novsync      - Disable vertical sync.
        -repeatframe  - Enable frame repeats.
        -updateall    - always update CSC matrices.
        -displayvideo - display video frames on the window
        -nointerop    - create the CUDA context w/o using graphics interop
        -readback     - enable readback of frames to system memory
        -device=n     - choose a specific GPU device to decode video with

I’m using the Average Rate of Decoding (fps) to compare the different configurations and cards.
Using two videos (both encoded in mpeg2 and h264) i’m getting this results (if the full console output is needed, i can upload it):

GTX 780 TI

HD (1920x1080) H264 -decodecuda: 273.92
HD (1920x1080) H264 -decodecuvid: 277.29
HD (1920x1080) MPEG2 -decodecuda: 1082.88
HD (1920x1080) MPEG2 -decodecuvid: 1075.57
SD (720x480) H264 -decodecuda: 1361.01
SD (720x480) H264 -decodecuvid: 1361.06
SD (720x480) MPEG2 -decodecuda: 6998.82
SD (720x480) MPEG2 -decodecuvid: 7116.12

GT 740

HD (1920x1080) H264 -decodecuda: 266.88
HD (1920x1080) H264 -decodecuvid: 277.81
HD (1920x1080) MPEG2 -decodecuda: 458.34
HD (1920x1080) MPEG2 -decodecuvid: 459.01
SD (720x480) H264 -decodecuda: 1085.41
SD (720x480) H264 -decodecuvid: 1085.81
SD (720x480) MPEG2 -decodecuda: 2586.19
SD (720x480) MPEG2 -decodecuvid: 2584.54

The cudaDecodeGL command is always invoked with the -nointerop flag.

According to the gathered information, both cards have a similiar performance in decoding h264, but the 780 has a better mpeg2 decoding performance. Using both the preferCuda and preferCuvid don’t give any significative performance difference.

I’ve read the nvidia pdf about nvcuvid. But, empirically, I can’t find any difference in performance between prefering VP (video procesor) or CUDA. I don’t know if this option is not implemented on Linux, and i’m not able to set it as you’ve said, or if there is another problem. Is there something that i’m missing?.

In addition, how should i read the cards specification in order to make some prediction about their performance about decoding (primarly h264 decoding)?. The Number of Cuda cores is not a valid indicator of the decoding performance. To what part of the specification should i pay more attention?.

Thanks!

Information about the video files:

$ ffmpeg -i hd.mp4 
ffmpeg version 1.0.6_patch-aac-resample-lock Copyright (c) 2000-2013 the FFmpeg developers
  built on Jun  2 2015 17:00:42 with gcc 4.8.3 (GCC) 20140911 (Red Hat 4.8.3-9)
  configuration: --enable-libvpx --enable-shared --prefix=/usr --enable-libtheora --enable-postproc --enable-gpl --enable-libmp3lame --enable-libvorbis --enable-libx264 --enable-libfdk_aac --enable-nonfree --libdir=/usr/lib64 --shlibdir=/usr/lib64
  libavutil      51. 73.101 / 51. 73.101
  libavcodec     54. 59.100 / 54. 59.100
  libavformat    54. 29.104 / 54. 29.104
  libavdevice    54.  2.101 / 54.  2.101
  libavfilter     3. 17.100 /  3. 17.100
  libswscale      2.  1.101 /  2.  1.101
  libswresample   0. 15.100 /  0. 15.100
  libpostproc    52.  0.100 / 52.  0.100
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x20eb240] multiple edit list entries, a/v desync might occur, patch welcome
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'hd.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf54.29.104
  Duration: 00:00:28.66, start: 1.533000, bitrate: 13610 kb/s
    Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 1920x1080 [SAR 1:1 DAR 16:9], 13607 kb/s, 29.97 fps, 29.97 tbr, 90k tbn, 59.94 tbc
    Metadata:
      handler_name    : VideoHandler

$ ffmpeg -i hd.m2v
ffmpeg version 1.0.6_patch-aac-resample-lock Copyright (c) 2000-2013 the FFmpeg developers
  built on Jun  2 2015 17:00:42 with gcc 4.8.3 (GCC) 20140911 (Red Hat 4.8.3-9)
  configuration: --enable-libvpx --enable-shared --prefix=/usr --enable-libtheora --enable-postproc --enable-gpl --enable-libmp3lame --enable-libvorbis --enable-libx264 --enable-libfdk_aac --enable-nonfree --libdir=/usr/lib64 --shlibdir=/usr/lib64
  libavutil      51. 73.101 / 51. 73.101
  libavcodec     54. 59.100 / 54. 59.100
  libavformat    54. 29.104 / 54. 29.104
  libavdevice    54.  2.101 / 54.  2.101
  libavfilter     3. 17.100 /  3. 17.100
  libswscale      2.  1.101 /  2.  1.101
  libswresample   0. 15.100 /  0. 15.100
  libpostproc    52.  0.100 / 52.  0.100
[mpegvideo @ 0x8e0240] max_analyze_duration 5000000 reached at 5005000
[mpegvideo @ 0x8e0240] Estimating duration from bitrate, this may be inaccurate
Input #0, mpegvideo, from 'hd.m2v':
  Duration: 00:00:01.80, bitrate: 104857 kb/s
    Stream #0:0: Video: mpeg2video (Main), yuv420p, 1920x1080 [SAR 1:1 DAR 16:9], 104857 kb/s, 29.97 fps, 29.97 tbr, 1200k tbn, 59.94 tbc

$ ffmpeg -i sd.mp4 
ffmpeg version 1.0.6_patch-aac-resample-lock Copyright (c) 2000-2013 the FFmpeg developers
  built on Jun  2 2015 17:00:42 with gcc 4.8.3 (GCC) 20140911 (Red Hat 4.8.3-9)
  configuration: --enable-libvpx --enable-shared --prefix=/usr --enable-libtheora --enable-postproc --enable-gpl --enable-libmp3lame --enable-libvorbis --enable-libx264 --enable-libfdk_aac --enable-nonfree --libdir=/usr/lib64 --shlibdir=/usr/lib64
  libavutil      51. 73.101 / 51. 73.101
  libavcodec     54. 59.100 / 54. 59.100
  libavformat    54. 29.104 / 54. 29.104
  libavdevice    54.  2.101 / 54.  2.101
  libavfilter     3. 17.100 /  3. 17.100
  libswscale      2.  1.101 /  2.  1.101
  libswresample   0. 15.100 /  0. 15.100
  libpostproc    52.  0.100 / 52.  0.100
[h264 @ 0x1bc8b00] mmco: unref short failure
    Last message repeated 7 times
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'sd.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf54.29.104
  Duration: 00:02:28.92, start: 0.000000, bitrate: 1324 kb/s
    Stream #0:0(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 720x480 [SAR 10:11 DAR 15:11], 1319 kb/s, 47.90 fps, 59.94 tbr, 90k tbn, 59.94 tbc
    Metadata:
      handler_name    : VideoHandler

$ ffmpeg -i sd.m2v 
ffmpeg version 1.0.6_patch-aac-resample-lock Copyright (c) 2000-2013 the FFmpeg developers
  built on Jun  2 2015 17:00:42 with gcc 4.8.3 (GCC) 20140911 (Red Hat 4.8.3-9)
  configuration: --enable-libvpx --enable-shared --prefix=/usr --enable-libtheora --enable-postproc --enable-gpl --enable-libmp3lame --enable-libvorbis --enable-libx264 --enable-libfdk_aac --enable-nonfree --libdir=/usr/lib64 --shlibdir=/usr/lib64
  libavutil      51. 73.101 / 51. 73.101
  libavcodec     54. 59.100 / 54. 59.100
  libavformat    54. 29.104 / 54. 29.104
  libavdevice    54.  2.101 / 54.  2.101
  libavfilter     3. 17.100 /  3. 17.100
  libswscale      2.  1.101 /  2.  1.101
  libswresample   0. 15.100 /  0. 15.100
  libpostproc    52.  0.100 / 52.  0.100
[mpegvideo @ 0x2205240] max_analyze_duration 5000000 reached at 5005000
[mpegvideo @ 0x2205240] Estimating duration from bitrate, this may be inaccurate
Input #0, mpegvideo, from 'sd.m2v':
  Duration: 00:00:01.78, bitrate: 104857 kb/s
    Stream #0:0: Video: mpeg2video (Main), yuv420p, 720x480 [SAR 8:9 DAR 4:3], 104857 kb/s, 59.94 fps, 59.94 tbr, 1200k tbn, 119.88 tbc

little_jimmy · August 27, 2015, 5:02am

“According to the gathered information, both cards have a similiar performance in decoding h264, but the 780 has a better mpeg2 decoding performance. Using both the preferCuda and preferCuvid don’t give any significative performance difference”

“But, empirically, I can’t find any difference in performance between prefering VP (video procesor) or CUDA”

my thinking is:
a) given your (significant) decoding load, multiple devices are a distinct design possibility
b) if you can/ could switch from ‘hard’ decoding (vp) to ‘soft’ decoding (cuda), you may favour devices with plenty of cuda cores, and in the process you might get the total device count down
otherwise, you may find yourself forced to consider a number of (a lot more) lightweight devices, as i doubt considerable vp muscle is packed into any device, implying that devices’ vps can easily be overloaded
perhaps i am wrong, but my view is that cuda cores generally scale across devices, in relation to target application; on the other hand, one may find vp power generally (a lot more) constant across devices
c) point b) is also dependent on what you do with the decoded video; moving decoding from vp to cuda cores may still see the vp significantly loaded, depending on the load, and what is done with the video in the end

“I don’t know if this option is not implemented on Linux”

the cuda video decoder implementation seems ‘inferior’ on linux, compared to windows
if you search ‘linux’ or ‘windows’ in the cuda video decoder pdf, you would soon realize this
thus, in this case, your os and format may be of importance
i get the idea it might be easier to push mpeg2 into soft decoding than h264, particularly on linux

little_jimmy · August 27, 2015, 9:39am

on second thought, if i am not mistaken, some of the devices, in particular quadro, can drive multiple displays
one would then expect these devices to have proportionally stronger vps

nickcis · August 27, 2015, 4:12pm

I’ve understand that, but, i’m trying to measure the processing power of each card.

How can i achieve that switch?. I’ve been testing using the cuda sample (3_Imaging/cudaDecodeGL), and the comandline flag (-decodecuda / -decodecuvid) in order to make that switch isn’t giving me any performance difference.

Now i’m only measuring the decoding performance of the diferent cards, i’m not doing anything with the video after the decoding process.

I’m trying to understand how to know which card will be the best for this job. Nvidia cards have very diferent prices. My budget doesn’t allow me to buy several cards and test which has the best decoding performance. I’ve never test a QUADRO or TESLA card, these cards aren’t cheap. I don’t want to repeat the mistake of buying a high priced card which will have aproximattly the same decoding performance as a cheaper one, eg: GT 740 vs GTX 780TI.

Is there a way to known which card will have the highest h264 decoding fps?.

little_jimmy · August 28, 2015, 5:06am

“I’ve understand that”

i understood that you understood the point; i merely reiterated for comprehensiveness in thought train

“How can i achieve that switch?”

yes, that is why i said ‘can/ could’
the medium of decoding (soft vs hard) seems to be os and codec dependent
hence, a possibility to consider is to temporarily consider windows as a test-bed and trial
also, evaluating the outcome when a different format is used (mpeg vs h264)

“Now i’m only measuring the decoding performance of the diferent cards, i’m not doing anything with the video after the decoding process.”

noted

“I’m trying to understand how to know which card will be the best for this job.”

…and clearly it is not a trivial task; perhaps due to proper documentation…?

“Nvidia cards have very diferent prices.”

quite

“My budget doesn’t allow me to buy several cards and test which has the best decoding performance.”

understood

if your budget permits, perhaps you can consider a low-end quadro that can drive multiple displays
i did spot such a card
the other alternative is to attempt to find in literature and other sources documentation of vps, their bandwidth, and their utility, particularly in terms of multi-displays
perhaps a nvidia consultant/ sales representative can assist in this regard

see this document

Video capture, encoding and streaming in a multi-GPU system

http://www.nvidia.com/docs/IO/40049/TB-Quadro_VideoCaptureStreaming_v01.pdf

it already moves in the direction of a bandwidth (throughput) measurement

the assumption thus far is that vps of gpus that can drive multiple displays can be ‘parallel exposed’, much like sm’s are parallel exposed
i do not know whether the assumption holds
if vps are documented (in detail), then i have missed this in depth discussion of vps

to some extent, i think that running a trial case on windows is your best option right now

barbashov · March 9, 2016, 2:56am

Hello nickcis,

I have an offtopic question for you as a Kepler videocard owner.
I’m currently working on a NVCUVID-enabled project, and facing quite interesting issue: mpeg2 decoding gives me garbage instead of a valid decoding result. Output example: http://barbashov.pro/mpeg2.mp4
At some point I’ve noticed that on Maxwell videocards (specifically GTX 980 Ti) mpeg2 decoding under Linux doesn’t work at all, and on Kepler videocards (GTX 680 and GTX 730) mpeg2 decoding works well with my code, but doesn’t work correctly with official cudaDecodeGL example.
So, my question is: have you ever actually took a look on a cudaDecodeGL output? I mean, if you run it with -displayvideo command-line argument, what do you see?

Thanks in advance!

Topic		Replies	Views
do i need FFmpeg for the parsing the video for decoding to NVIDIA gpu or NVDEC can handel it CUDA Programming and Performance	0	400	March 25, 2019
8 channel 4K HEVC decoder Video Processing & Optical Flow	7	2956	May 30, 2019
CUDA 5.0 (Decode video using NVCUVID) and Performance CUDA Programming and Performance	2	3622	November 8, 2012
[Jetson Xavier]Hardware video decode doesn't work Jetson AGX Xavier	12	1323	May 27, 2019
High CPU usage in omxh264dec and low performance as compared to nv_omx_h264dec Jetson TK1	20	5083	October 18, 2021
LAV CUVID Decoder - High Quality Hardware decoding CUDA Programming and Performance	0	2863	May 1, 2015
Video decoding: Video Processing & Optical Flow	5	4230	October 12, 2021
Optimizing Video Memory Usage with the NVDECODE API and NVIDIA Video Codec SDK Technical Blog	2	669	January 5, 2023
Difference in performace for parallell decode encode with ffmpeg h264_cuvid and h264_nvenc Tesla P100 GPU-Accelerated Libraries	0	1477	November 14, 2017
the cpu usage cannot down (use cuda decode) Jetson TX1	29	12026	October 18, 2021

[Linux] NVCuvid - Performarce

Related topics