the cpu usage cannot down (use cuda decode)

tegra_multimedia_api\samples\02_video_dec_cuda

i use the sample, receive rtsp data, and then cuda h264 decode, but cpu usage can’t down

the cpu usage is the same as ffmpeg 2.7.4 h264 decode by cpu

tegrastats:
RAM 894/3994MB (lfb 437x4MB) cpu [33%,15%,6%,31%]@1734 GR3D 0%@76 EDP limit 0

tx1 info:

TVMR: cbBeginSequence: 1166: BeginSequence 1920x1088, bVPR = 0, fFrameRate = 25.000000
TVMR: LowCorner Frequency = 0
TVMR: cbBeginSequence: 1545: DecodeBuffers = 2, pnvsi->eCodec = 4, codec = 0
TVMR: cbBeginSequence: 1606: Display Resolution : (1920x1080)
TVMR: cbBeginSequence: 1607: Display Aspect Ratio : (1920x1080)
TVMR: cbBeginSequence: 1649: ColorFormat : 5
TVMR: cbBeginSequence:1660 ColorSpace = NvColorSpace_YCbCr709
TVMR: cbBeginSequence: 1790: SurfaceLayout = 3
TVMR: cbBeginSequence: 1868: NumOfSurfaces = 3, InteraceStream = 0, InterlaceEnabled = 0, bSecure = 0, MVC = 0 Semiplanar = 1, bReinit = 1, BitDepthForSurface = 8 LumaBitDepth = 8, ChromaBitDepth = 8, ChromaFormat = 5
ev.type == V4L2_EVENT_RESOLUTION_CHANGE
Video Resolution: 1920x1080
libv4l2_nvvidconv (0):(765) (INFO) : Allocating (8) OUTPUT PLANE BUFFERS Layout=1
libv4l2_nvvidconv (0):(775) (INFO) : Allocating (8) CAPTURE PLANE BUFFERS Layout=0
Query and set capture successful
TVMR: FrameRate = 25.000000
TVMR: FrameRate = 25.000000
TVMR: FrameRate = 25.000000
TVMR: FrameRate = 25.000000
TVMR: FrameRate = 25.000000
TVMR: FrameRate = 25.000000

CudaDecode.h (2.94 KB)
CudaDecode.cpp (23.5 KB)

Hi SunYe,
Please share steps in detail so that we can do further check. Thanks.

look attach

use:

CudaDecode cudaDecoder[MAX_VIDEO_CHAN];
cudaDecoder[i].SetChan(i);
cudaDecoder[i].Open();
cudaDecoder[iChan].DecodeFrame(packet->data, packet->size);
decoder[i].Close();

steps:
1、receive rtsp h264 video data from network
2、use cuda decode

Hi SunYe,
The following APIs
CudaDecode cudaDecoder[MAX_VIDEO_CHAN];
cudaDecoder[i].SetChan(i);
cudaDecoder[i].Open();
cudaDecoder[iChan].DecodeFrame(packet->data, packet->size);
decoder[i].Close();

are not in
02_video_dec_cuda

Which sample do you refer to, please?

class CudaDecode is I modify 02_video_dec_cuda, this api is my api, the code is referentce 02_video_dec_cuda.
see the code at 1 floor attach

CudaDecode.h
CudaDecode.cpp

Hi SunYe,
The following is what we observe via 02_video_dec_cuda

RAM 1101/3995MB (lfb 1x4MB) cpu [9%,4%,4%,0%]@1734 EMC 7%@1600 AVP 49%@12 NVDEC 192 MSENC 192 GR3D 0%@998 EDP limit 1734
RAM 1101/3995MB (lfb 1x4MB) cpu [5%,9%,2%,0%]@1734 EMC 7%@1600 AVP 49%@12 NVDEC 192 MSENC 192 GR3D 0%@998 EDP limit 1734
RAM 1101/3995MB (lfb 1x4MB) cpu [6%,2%,3%,9%]@1734 EMC 7%@1600 AVP 49%@12 NVDEC 192 MSENC 192 GR3D 4%@998 EDP limit 1734
RAM 1101/3995MB (lfb 1x4MB) cpu [7%,2%,0%,10%]@1734 EMC 7%@1600 AVP 49%@12 NVDEC 716 MSENC 716 GR3D 8%@998 EDP limit 1734

The video file is bourne_ultimatum_trailer.zip - Download The Bourne Ultimatum - High Definition (1080p) Theatrical Trailer - dvdloc8.com
We extract h264 stream and run

./video_dec_cuda /home/ubuntu/Bourne_Trailer.h264 H264

my program use cuda decode (rtsp+cuda):

RAM 887/3994MB (lfb 88x4MB) cpu [28%,24%,17%,21%]@518 EMC 12%@204 AVP 84%@14 GR3D 7%@76 EDP limit 1734
RAM 887/3994MB (lfb 88x4MB) cpu [28%,18%,12%,8%]@102 EMC 6%@408 AVP 83%@17 GR3D 0%@76 EDP limit 1734
RAM 887/3994MB (lfb 88x4MB) cpu [27%,20%,16%,12%]@307 EMC 1%@1600 AVP 83%@14 GR3D 5%@76 EDP limit 1734
RAM 887/3994MB (lfb 88x4MB) cpu [33%,24%,18%,5%]@102 EMC 39%@68 AVP 83%@15 GR3D 0%@76 EDP limit 1734
RAM 888/3994MB (lfb 88x4MB) cpu [34%,27%,25%,8%]@710 EMC 26%@102 AVP 82%@14 GR3D 0%@76 EDP limit 1734
RAM 887/3994MB (lfb 88x4MB) cpu [24%,10%,18%,6%]@102 EMC 39%@68 AVP 84%@15 GR3D 0%@76 EDP limit 1734
RAM 888/3994MB (lfb 88x4MB) cpu [26%,26%,18%,10%]@307 EMC 1%@1600 AVP 83%@15 GR3D 0%@76 EDP limit 1734
RAM 888/3994MB (lfb 88x4MB) cpu [16%,21%,25%,9%]@102 EMC 13%@204 AVP 11%@115 GR3D 0%@76 EDP limit 1734
RAM 888/3994MB (lfb 88x4MB) cpu [28%,15%,24%,12%]@204 EMC 1%@1600 AVP 83%@14 GR3D 0%@76 EDP limit 1734

not use any decode:
RAM 878/3994MB (lfb 92x4MB) cpu [24%,2%,10%,7%]@518 EMC 22%@40 AVP 83%@19 GR3D 0%@76 EDP limit 1734
RAM 878/3994MB (lfb 92x4MB) cpu [23%,0%,2%,19%]@102 EMC 22%@40 AVP 85%@17 GR3D 0%@76 EDP limit 1734
RAM 878/3994MB (lfb 92x4MB) cpu [21%,5%,6%,6%]@307 EMC 13%@68 AVP 85%@17 GR3D 0%@76 EDP limit 1734
RAM 878/3994MB (lfb 92x4MB) cpu [8%,2%,20%,2%]@102 EMC 2%@408 AVP 82%@17 GR3D 0%@76 EDP limit 1734
RAM 878/3994MB (lfb 93x4MB) cpu [29%,0%,12%,1%]@204 EMC 13%@68 AVP 85%@15 GR3D 0%@76 EDP limit 1734
RAM 878/3994MB (lfb 93x4MB) cpu [26%,3%,12%,4%]@102 EMC 22%@40 AVP 82%@17 GR3D 0%@76 EDP limit 1734
RAM 878/3994MB (lfb 93x4MB) cpu [25%,0%,5%,13%]@204 EMC 13%@68 AVP 84%@17 GR3D 0%@76 EDP limit 1734

use ffmpeg decode(rtsp+ffmpeg):

RAM 884/3994MB (lfb 91x4MB) cpu [15%,30%,3%,37%]@1734 EMC 1%@1600 AVP 83%@12 GR3D 0%@76 EDP limit 1734
RAM 885/3994MB (lfb 91x4MB) cpu [7%,3%,0%,62%]@307 EMC 42%@68 AVP 83%@12 GR3D 0%@76 EDP limit 1734
RAM 885/3994MB (lfb 91x4MB) cpu [11%,1%,1%,70%]@204 EMC 7%@408 AVP 69%@12 GR3D 0%@76 EDP limit 1734
RAM 885/3994MB (lfb 91x4MB) cpu [21%,8%,24%,26%]@204 EMC 42%@68 AVP 78%@12 GR3D 0%@76 EDP limit 1734
RAM 884/3994MB (lfb 91x4MB) cpu [32%,2%,1%,47%]@1224 EMC 1%@1600 AVP 85%@13 GR3D 0%@76 EDP limit 1734
RAM 885/3994MB (lfb 91x4MB) cpu [25%,32%,2%,20%]@1734 EMC 1%@1600 AVP 82%@13 GR3D 0%@76 EDP limit 1734
RAM 885/3994MB (lfb 91x4MB) cpu [20%,7%,54%,1%]@1734 EMC 1%@1600 AVP 84%@12 GR3D 0%@76 EDP limit 1734
RAM 885/3994MB (lfb 91x4MB) cpu [8%,42%,16%,4%]@1734 EMC 1%@1600 AVP 82%@13 GR3D 0%@76 EDP limit 1734
RAM 884/3994MB (lfb 91x4MB) cpu [17%,3%,50%,5%]@307 EMC 42%@68 AVP 85%@14 GR3D 0%@76 EDP limit 1734

In intel i7 6700 + GTX960 video card (win 7 system),

use cuda decode , the cpu usage is apparent lower than ffmpeg decode,
but at tx1, the situation is different. WHY?

Hi SunYe,
Are you able to get the same result as [url]https://devtalk.nvidia.com/default/topic/1014789/jetson-tx1/-the-cpu-usage-cannot-down-use-cuda-decode-/post/5175104/#5175104[/url] ?

What is the API you use when you say ‘cude decode’? The HW decoder on TX1/TX2 is individual HW engine, not GPU.

Hi SunYe,

Any update? Could you share the result and progress?

Thanks

project code cann’t share, because the corporation forbid.
the key tx1 decoder code I have share at 1 floor.

the result:
1、H264@10M (frameRate:25)
2 channel H264@10M ffmpeg decode drop frame,cpu 30%(include rtsp)
1 channel H264@10M ffmpeg decode is OK,cpu 25%(include rtsp)

2 channel H264@10M cuda decode ok, cpu 32%(include rtsp)
3 channel H264@10M cuda decode drop frame(include rtsp)

2、H265@10M (frameRate:25)
1 channel H265@10M ffmpeg decode drop frame,cpu 25%(private protocal get stream)
2 channel H264@10M cuda decode ok, cpu 30%, (private protocal get stream)
3 channel H264@10M cuda decode drop frame

cuda decode only can decode h264/h265@10M at 2 channel, cann’t more

project code cann’t share, because the corporation forbid.
the key tx1 decoder code I have share at 1 floor.

the result:
1、H264@10M (frameRate:25)
2 channel H264@10M ffmpeg decode drop frame,cpu 30%(include rtsp)
1 channel H264@10M ffmpeg decode is OK,cpu 25%(include rtsp)

2 channel H264@10M cuda decode ok, cpu 32%(include rtsp)
3 channel H264@10M cuda decode drop frame(include rtsp)

2、H265@10M (frameRate:25)
1 channel H265@10M ffmpeg decode drop frame,cpu 25%(private protocal get stream)
2 channel H265@10M cuda decode ok, cpu 30%, (private protocal get stream)
3 channel H265@10M cuda decode drop frame

cuda decode only can decode h264/h265@10M at 2 channel, cann’t more

We have verified four 1080p25 transcoding on TX1. It should also work for using MM APIs.
[url]https://devtalk.nvidia.com/default/topic/979908/jetson-tx1/gstreamer-transcoding-performance-issue/post/5033461/#5033461[/url]

And please let me emphasize again that it is not ‘cuda decode’. The HW decoder on TX1/TX2 is individual HW engine, not GPU.

My Scene is Realtime Recevie Video Data And Decode, Not Transcoding.

If the average decode time cann’t below 1000/25= 40ms, the h264/h265 data will overflow,

if I put more than 2 channel h264/h265 data (10Mbit Data per second) into tx1 decoder ,

the tx1 decode engineer cann’t decode quick enough, so the h264/h265 data will overflow.

And please let me emphasize again that: our Scene is Realtime Decode, not offline Decode,

so the performance evaluate method is different. and the result is also different.

you can use live555 to establish a rtsp service , and use cmd :
gst-launch-1.0 rtspsrc location=“rtsp://192.168.110.232/3.mkv” ! rtph264depay ! h264parse ! omxh264dec ! nvoverlaysink -e
to test the realtime decode performance.

the offline decode performance and realtime decode performance is different.

I doesn’t use the cmd , I use my rtsp program to test.

and I doesn’t use live555 rtsp service ,I use ip camera for rtsp service.

the result is : tx1 realtime decode performance is 2 channel h264/h265 decode @ 10Mbit , 25framerate, cann’t more

We have verified four 1080p25 @ 7.5Mbit streaming playback on two TX1-r24.2.1. One TX1 is as server and the other is as client.

Test video file: bourne_ultimatum_trailer.zip - Download The Bourne Ultimatum - High Definition (1080p) Theatrical Trailer - dvdloc8.com

Server
Compile gst-rtsp-server/test-mp4.c at master · GStreamer/gst-rtsp-server · GitHub
Start rtspserver

$ ./test-mp4 Bourne_Trailer.mp4

Client

$ export RTSP_PATH=rtsp://10.19.106.151:8554/test
$ gst-launch-1.0 rtspsrc location="$RTSP_PATH" ! rtph264depay ! h264parse ! omxh264dec ! nveglglessink window-x=100 window-y=100 window-width=640 window-height=360 & gst-launch-1.0 rtspsrc location="$RTSP_PATH" ! rtph264depay ! h264parse ! omxh264dec ! nveglglessink window-x=800 window-y=100 window-width=640 window-height=360 & gst-launch-1.0 rtspsrc location="$RTSP_PATH" ! rtph264depay ! h264parse ! omxh264dec ! nveglglessink window-x=100 window-y=500 window-width=640 window-height=360  & gst-launch-1.0 rtspsrc location="$RTSP_PATH" ! rtph264depay ! h264parse ! omxh264dec ! nveglglessink window-x=800 window-y=500 window-width=640 window-height=360

The result is identical to offline decoding.

please test h264 @ 10Mbit, and h265 @ 10Mbit
And I suggest use a virtual machine linux for rtsp service
(Use task management graphics interface to see the bitstream is 10Mb or not)

and you should record the video data to see if the video is drop frame or not.