260M GPU memory usage for one GPU h264 video decoder is normal?

Dear NVIDIA experts,

We have implemented GPU hardware accelerated h.264 video decoding functionality via NVCUVID API on NVIDIA P4 card (and 1080 card) which has a PASCAL architecture.

Actually this implementation is almost the copy of codes from $CUDASAMPLE\v8.0\3_Imaging\cudaDecodeD3D9 project, except that we replace videoSource part with our own video demuxer which is responsible for getting H.264 NALU bitstream from video file.

Now we want to measure the performance of this GPU hardware h.264 decoder. For this, two experiments are executed:

– Experiment 1.
Input a video file with resolution of 1920 x 1080 and run one decoder concurrently. The benchmark data are:
- decoder’s fps: 650fps ~ 850fps ( Note that, speed varies on different video source)
- GPU memory used by this decoder: 260M Bytes

– Experiment 2.
Input 4 video files and run 4 decoders concurrently. Note that, each decoder is responsible for one video file’s decoding. Actually, those 4 videos are all the same copy of a H.264 video file with resolution of 1920x1080. The benchmark data are:
- 4 decoders’ fps summation: 650fps ~ 850fps ( Note that, speed varies on different video source )
- GPU memory used by these 4 decoders: (260M * 4) Bytes

We have read Nvidia Video Codec SDK Application Notes, and found that the speed measured by us is roughly similar with the speed mentioned in that application notes, but there are no any informative data related to GPU memory usage in that notes and any official NVIDIA webpage.

As per our experiments, it seems that each video decoder instance will take up ~260M GPU memory for 1920x1080 video. Frankly this account is much higher than our expectation since according to our experience H264 video decoder will not take up more than 50M bytes for 1920x1080 video in most cases.

We ever planed to use one GPU card to decode concurrently 20 channels’ HD living H264 streams via 20 decoder instances which are working concurrently, and execute some other image processing tasks, but now it seems that memory is a big challenge, since decoder takes up too much memory.

Now we just want to know whether 260M GPU memory usage for one GPU h264 HD video decoder we measured is normal. Is there any reference info related to memory usage for GPU accelerated video decoder, especially H264 baseline and main profile decoder?

Any information is appreciated!

br,
zxjan

A bit tangential: We have trouble with the same issue on the encode side. Every encode context for a 1 megapixel size image takes 100MB VRAM, severely limiting the number of concurrent encodes. (On mid-to-high-end Quadro cards, mostly Kepler or Maxwell) The 4:2:0 sampled YUV is only 1.5MB, so even allowing for lots of internal buffers, 100MB seems exorbitant. I have not yet looked in detail to see if the input buffer or the actual encode context causes the memory jump.

NathanKidd, than you for your information!

In my case, the video to decode has a resolution of 1920*1080, 30fps, and YUV420 pixel format, and it is encoded in a compression format of H264 baseline profile, it should not use too much memory as per our expericence.
Besides, actually our implementation of CUDA decoder is almost the copy of CUDA samples, and the FrameQueue which acts as an display frame buffer has only a maximum size of 20 frames, so 260M GPU memory for one single video decoder is too much exorbitant.

Hope NVIDIA experts could see this post.

Dear NVIDIA experts,

Any suggestion?
Any information is appreciated!

Hi

260MB memory consumption seems a bit too much and is not expected for a full HD frame. Here are a few questions/suggestions to better tackle/root-cause the problem:

  1. You seem to be referring to samples from CUDA SDK. We advise you to refer to the samples included in https://developer.nvidia.com/nvidia-video-codec-sdk. The samples from CUDA SDK are outdated.

  2. Can you please let us know the Driver version and SDK (CUDA SDK and Video SDK) version that you are using?

  3. Can you specify the values that you are using for “ulNumDecodeSurfaces”, “ulNumOutputSurfaces” and “DeinterlaceMode” in the structure CUVIDDECODECREATEINFO?

Thanks

Dear vignesh,

Thank you for your response!

1)
You seem to be referring to samples from CUDA SDK. We advise you to refer to the samples included in https://developer.nvidia.com/nvidia-video-codec-sdk. The samples from CUDA SDK are outdated.

===>

 Yes. we used the samples from CUDA SDK which has a version number of <b>8.0.44</b>

.
We installed this CUDA sdk via cuda_8.0.44_windows.exe installer which is downloaded from NVIDIA official website.
After installing, we used the sample which locates at C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0
\3_Imaging\cudaDecodeD3D9\cudaDecodeD3D9_vs2010.sln.

 <b>While since the video codec sdk you recommeneded (https://developer.nvidia.com/nvidia-video-codec-sdk) has a version of 8.0.14, which is even older than 8.0.44 we used, thus we didn't try it. We still used cudaDecodeD3D9 sample from cuda_8.0.44.</b>
  1. The whole measuring procedure is like below:
  • a) we compiled the C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0 \3_Imaging\cudaDecodeD3D9\cudaDecodeD3D9_vs2010.sln with X64 DEBUG confiugration
  • b) Then we runned the app "cudaDecodeD3D9.exe" with arguments of "-displayvideo plush1_720p_10s.m2v". plush1_720p_10s.m2v is just the test video provided by cudaDecodeD3D9 sample itself. It has a resolution of 1280 * 720 and a duration of 7 seconds.
  • c) During running, we observed GPU memory usage via GPU-Z tool(version: 1.18.0), and got records as below: -- beforing running: GPU memory used is 269 MB -- During running: max GPU memory used is 467 MB -- Conclusion: GPU memory consumption is 467-269=197 MB @ 1280 * 720 video
  • d) hen we changed another video, which has a resolution of 1920*1080, and got records as below: -- beforing running: GPU memory used is 285 MB -- During running: max GPU memory used is 531 MB -- Conclusion: GPU memory consumption is 531-285=246 MB @ 1920 * 1080 video

What’s wrong in our experiments?

Any information is apprepriciated!

zxjan

Dear vignesh,

Additional info again!

For the structure CUVIDDECODECREATEINFO, we initialized it like below:

------------------------------------------------------------------------

memset(&oVideoDecodeCreateInfo_, 0, sizeof(CUVIDDECODECREATEINFO));

oVideoDecodeCreateInfo_.CodecType           = rVideoFormat.codec;
oVideoDecodeCreateInfo_.ulWidth             = rVideoFormat.coded_width;
oVideoDecodeCreateInfo_.ulHeight            = rVideoFormat.coded_height;

oVideoDecodeCreateInfo_.ulNumDecodeSurfaces = 20;

oVideoDecodeCreateInfo_.ChromaFormat        = cudaVideoChromaFormat_420; 
oVideoDecodeCreateInfo_.OutputFormat        = cudaVideoSurfaceFormat_NV12;

oVideoDecodeCreateInfo_.DeinterlaceMode = cudaVideoDeinterlaceMode_Adaptive;

oVideoDecodeCreateInfo_.ulTargetWidth       = rVideoFormat.display_area.right - rVideoFormat.display_area.left; 
oVideoDecodeCreateInfo_.ulTargetHeight      = rVideoFormat.display_area.bottom - rVideoFormat.display_area.top; 

oVideoDecodeCreateInfo_.ulNumOutputSurfaces = 2;
oVideoDecodeCreateInfo_.ulCreationFlags = cudaVideoCreate_PreferCUDA;
oVideoDecodeCreateInfo_.vidLock = vidCtxLock;


If any further info in need, let us know pls.

zxjan

Dear vignesh,

Are you here?

Could you give us a hand?

zxjan

We have the same issue and almost the same vram usage per stream.
We are using ffmpeg hw accelerated as backend for opencv, and use opencv to capture rtsp video.
It seems that when we capture each video, it consumes gpu ram about 250MB.
We just test only with fullHD stream with 1080ti.
This issue very limits our decoding streams in scale. We can only decode about ~40 streams concurrently
with 1080ti which has about 11GB.

As zxjan said, if decoding fullHD stream should not consume much ram but just about 50MB, we can scale it up almost 5 times,
which means we can optimize to decode about ~200 streams concurrently.

Please help us regarding this issue.

Lert

I have the same issues.
Is there any updates ??

Hi.

NVDEC_VideoDecoder_API_ProgGuide.pdf in NVIDIA Video Codec SDK 9.1 contains a section,
4.8 WRITING AN EFFICIENT DECODE APPLICATION

This contains some hints about how to write an application with optimized video memory usage.
Let us know if you find that useful and have any further questions.

Thanks.