CUDA 5.0 (Decode video using NVCUVID) and Performance

Hey guys,

I decided to test the decoding of 720p video (MPEG2) on a variety of graphics devices. In my possession are four devices:

  • Nvidia GeForce 8600 GT (32 cores)
  • Nvidia Tesla C1060 (240 cores)
  • Nvidia GeForce GTX 680 (1536 cores)
  • Nvidia GeForce 560M (192 cores)
    I execute the application of cudaDecodeGL.exe(from SDK) with the parameter of -nointerop and the input file plush1_720p_10s.m2v (is an example).
    Result [Average Rate of Decoding (fps)]:
    Nvidia GeForce 8600 GT (32 cores): ~105 fps
    Nvidia Tesla C1060 (240 cores): ~550 fps
    Nvidia GeForce GTX 680 (1536 cores): ~800 fps
    Nvidia GeForce 560M (192 cores): ~460 fps

The device of Nvidia GeForce GTX 680 is based on the technology of Kepler (GK104), is on board in 1536 cores, it is 48 times more than the 8600 GT or 6 times more than the Tesla 1060. So why are such uneven results? Should we expect a good performance gain in video decoding?

If you take the other formats such as H.264, the situation is the same, very low productivity Kepler. Why?

Thanks!

Decoding is done by hard wired circuit, not by the 3D rendering pipeline. So only the GPU generation counts, not the number of cores.

I will bring a sample.
Studying the API can specify the type of decoding:

typedef enum cudaVideoCreateFlags_enum {
     cudaVideoCreate_Default = 0x00, // Default operation mode: use dedicated video engines
     cudaVideoCreate_PreferCUDA = 0x01, // Use a CUDA-based decoder if faster than dedicated engines (requires a valid vidLock object for multi-threading)
     cudaVideoCreate_PreferDXVA = 0x02, // Go through DXVA internally if possible (requires D3D9 interop)
     cudaVideoCreate_PreferCUVID = 0x04 // Use dedicated video engines directly
} CudaVideoCreateFlags;

In the example you can specify this in mind the parameters:

printf (" t-decodecuda - Use CUDA for MPEG-2 (Available with 64 + CUDA cores)  n");
printf (" t-decodedxva - Use VP for MPEG-2, VC-1, H.264 decode.  n");
printf (" t-decodecuvid - Use VP for MPEG-2, VC-1, H.264 decode (optimized)  n");

Will review the results of two modes of operation (software (-decodecuda) and hardware (-decodecuvid) decoding):
[Call cudaDecodeGL: cudaDecodeGL.exe -nointertop -decodecuda/-decodecuvid plush1_720p_10s.m2v]
Device: GeForce 8600 GT -decodecuda: 105 fps -decodecuvid: 140 fps
Device: Tesla C1060 -decodecuda: 565 fps -decodecuvid: 140 fps
Device: GeForce GTX 460 -decodecuda: 715 fps -decodecuvid: 210 fps
Device: GeForce GTX 680 -decodecuda: 810 fps -decodecuvid: 370 fps

As can be seen for the GeForce 8600 GT Software decoding loses hardwar decoding, as the number of cores of the device is less than 64. All logical.
The more a nuclear device by decoding using CUDA wins (hardwar decoding loses). And then all logical.

But I am not to understand why Kepler (GeForce GTX 680) gives a small result for CUDA decoding …