the cpu usage cannot down (use cuda decode)

SunYe · June 19, 2017, 11:22am

tegra_multimedia_api\samples\02_video_dec_cuda

i use the sample, receive rtsp data, and then cuda h264 decode, but cpu usage can’t down

the cpu usage is the same as ffmpeg 2.7.4 h264 decode by cpu

tegrastats:
RAM 894/3994MB (lfb 437x4MB) cpu [33%,15%,6%,31%]@1734 GR3D 0%@76 EDP limit 0

tx1 info:

TVMR: cbBeginSequence: 1166: BeginSequence 1920x1088, bVPR = 0, fFrameRate = 25.000000
TVMR: LowCorner Frequency = 0
TVMR: cbBeginSequence: 1545: DecodeBuffers = 2, pnvsi->eCodec = 4, codec = 0
TVMR: cbBeginSequence: 1606: Display Resolution : (1920x1080)
TVMR: cbBeginSequence: 1607: Display Aspect Ratio : (1920x1080)
TVMR: cbBeginSequence: 1649: ColorFormat : 5
TVMR: cbBeginSequence:1660 ColorSpace = NvColorSpace_YCbCr709
TVMR: cbBeginSequence: 1790: SurfaceLayout = 3
TVMR: cbBeginSequence: 1868: NumOfSurfaces = 3, InteraceStream = 0, InterlaceEnabled = 0, bSecure = 0, MVC = 0 Semiplanar = 1, bReinit = 1, BitDepthForSurface = 8 LumaBitDepth = 8, ChromaBitDepth = 8, ChromaFormat = 5
ev.type == V4L2_EVENT_RESOLUTION_CHANGE
Video Resolution: 1920x1080
libv4l2_nvvidconv (0):(765) (INFO) : Allocating (8) OUTPUT PLANE BUFFERS Layout=1
libv4l2_nvvidconv (0):(775) (INFO) : Allocating (8) CAPTURE PLANE BUFFERS Layout=0
Query and set capture successful
TVMR: FrameRate = 25.000000
TVMR: FrameRate = 25.000000
TVMR: FrameRate = 25.000000
TVMR: FrameRate = 25.000000
TVMR: FrameRate = 25.000000
TVMR: FrameRate = 25.000000

CudaDecode.h (2.94 KB)
CudaDecode.cpp (23.5 KB)

DaneLLL · June 20, 2017, 1:32am

Hi SunYe,
Please share steps in detail so that we can do further check. Thanks.

SunYe · June 20, 2017, 1:44am

look attach

use:

CudaDecode cudaDecoder[MAX_VIDEO_CHAN];
cudaDecoder[i].SetChan(i);
cudaDecoder[i].Open();
cudaDecoder[iChan].DecodeFrame(packet->data, packet->size);
decoder[i].Close();

SunYe · June 20, 2017, 1:47am

steps:
1、receive rtsp h264 video data from network
2、use cuda decode

DaneLLL · June 20, 2017, 1:52am

Hi SunYe,
The following APIs
CudaDecode cudaDecoder[MAX_VIDEO_CHAN];
cudaDecoder[i].SetChan(i);
cudaDecoder[i].Open();
cudaDecoder[iChan].DecodeFrame(packet->data, packet->size);
decoder[i].Close();

are not in
02_video_dec_cuda

Which sample do you refer to, please?

SunYe · June 20, 2017, 2:55am

class CudaDecode is I modify 02_video_dec_cuda, this api is my api, the code is referentce 02_video_dec_cuda.
see the code at 1 floor attach

CudaDecode.h
CudaDecode.cpp

DaneLLL · June 28, 2017, 2:23am

Hi SunYe,
The following is what we observe via 02_video_dec_cuda

RAM 1101/3995MB (lfb 1x4MB) cpu [9%,4%,4%,0%]@1734 EMC 7%@1600 AVP 49%@12 NVDEC 192 MSENC 192 GR3D 0%@998 EDP limit 1734
RAM 1101/3995MB (lfb 1x4MB) cpu [5%,9%,2%,0%]@1734 EMC 7%@1600 AVP 49%@12 NVDEC 192 MSENC 192 GR3D 0%@998 EDP limit 1734
RAM 1101/3995MB (lfb 1x4MB) cpu [6%,2%,3%,9%]@1734 EMC 7%@1600 AVP 49%@12 NVDEC 192 MSENC 192 GR3D 4%@998 EDP limit 1734
RAM 1101/3995MB (lfb 1x4MB) cpu [7%,2%,0%,10%]@1734 EMC 7%@1600 AVP 49%@12 NVDEC 716 MSENC 716 GR3D 8%@998 EDP limit 1734

The video file is bourne_ultimatum_trailer.zip - Download The Bourne Ultimatum - High Definition (1080p) Theatrical Trailer - dvdloc8.com
We extract h264 stream and run

./video_dec_cuda /home/ubuntu/Bourne_Trailer.h264 H264

SunYe · June 28, 2017, 9:11am

my program use cuda decode （rtsp+cuda）：

RAM 887/3994MB (lfb 88x4MB) cpu [28%,24%,17%,21%]@518 EMC 12%@204 AVP 84%@14 GR3D 7%@76 EDP limit 1734
RAM 887/3994MB (lfb 88x4MB) cpu [28%,18%,12%,8%]@102 EMC 6%@408 AVP 83%@17 GR3D 0%@76 EDP limit 1734
RAM 887/3994MB (lfb 88x4MB) cpu [27%,20%,16%,12%]@307 EMC 1%@1600 AVP 83%@14 GR3D 5%@76 EDP limit 1734
RAM 887/3994MB (lfb 88x4MB) cpu [33%,24%,18%,5%]@102 EMC 39%@68 AVP 83%@15 GR3D 0%@76 EDP limit 1734
RAM 888/3994MB (lfb 88x4MB) cpu [34%,27%,25%,8%]@710 EMC 26%@102 AVP 82%@14 GR3D 0%@76 EDP limit 1734
RAM 887/3994MB (lfb 88x4MB) cpu [24%,10%,18%,6%]@102 EMC 39%@68 AVP 84%@15 GR3D 0%@76 EDP limit 1734
RAM 888/3994MB (lfb 88x4MB) cpu [26%,26%,18%,10%]@307 EMC 1%@1600 AVP 83%@15 GR3D 0%@76 EDP limit 1734
RAM 888/3994MB (lfb 88x4MB) cpu [16%,21%,25%,9%]@102 EMC 13%@204 AVP 11%@115 GR3D 0%@76 EDP limit 1734
RAM 888/3994MB (lfb 88x4MB) cpu [28%,15%,24%,12%]@204 EMC 1%@1600 AVP 83%@14 GR3D 0%@76 EDP limit 1734

not use any decode：
RAM 878/3994MB (lfb 92x4MB) cpu [24%,2%,10%,7%]@518 EMC 22%@40 AVP 83%@19 GR3D 0%@76 EDP limit 1734
RAM 878/3994MB (lfb 92x4MB) cpu [23%,0%,2%,19%]@102 EMC 22%@40 AVP 85%@17 GR3D 0%@76 EDP limit 1734
RAM 878/3994MB (lfb 92x4MB) cpu [21%,5%,6%,6%]@307 EMC 13%@68 AVP 85%@17 GR3D 0%@76 EDP limit 1734
RAM 878/3994MB (lfb 92x4MB) cpu [8%,2%,20%,2%]@102 EMC 2%@408 AVP 82%@17 GR3D 0%@76 EDP limit 1734
RAM 878/3994MB (lfb 93x4MB) cpu [29%,0%,12%,1%]@204 EMC 13%@68 AVP 85%@15 GR3D 0%@76 EDP limit 1734
RAM 878/3994MB (lfb 93x4MB) cpu [26%,3%,12%,4%]@102 EMC 22%@40 AVP 82%@17 GR3D 0%@76 EDP limit 1734
RAM 878/3994MB (lfb 93x4MB) cpu [25%,0%,5%,13%]@204 EMC 13%@68 AVP 84%@17 GR3D 0%@76 EDP limit 1734

use ffmpeg decode（rtsp+ffmpeg）：

RAM 884/3994MB (lfb 91x4MB) cpu [15%,30%,3%,37%]@1734 EMC 1%@1600 AVP 83%@12 GR3D 0%@76 EDP limit 1734
RAM 885/3994MB (lfb 91x4MB) cpu [7%,3%,0%,62%]@307 EMC 42%@68 AVP 83%@12 GR3D 0%@76 EDP limit 1734
RAM 885/3994MB (lfb 91x4MB) cpu [11%,1%,1%,70%]@204 EMC 7%@408 AVP 69%@12 GR3D 0%@76 EDP limit 1734
RAM 885/3994MB (lfb 91x4MB) cpu [21%,8%,24%,26%]@204 EMC 42%@68 AVP 78%@12 GR3D 0%@76 EDP limit 1734
RAM 884/3994MB (lfb 91x4MB) cpu [32%,2%,1%,47%]@1224 EMC 1%@1600 AVP 85%@13 GR3D 0%@76 EDP limit 1734
RAM 885/3994MB (lfb 91x4MB) cpu [25%,32%,2%,20%]@1734 EMC 1%@1600 AVP 82%@13 GR3D 0%@76 EDP limit 1734
RAM 885/3994MB (lfb 91x4MB) cpu [20%,7%,54%,1%]@1734 EMC 1%@1600 AVP 84%@12 GR3D 0%@76 EDP limit 1734
RAM 885/3994MB (lfb 91x4MB) cpu [8%,42%,16%,4%]@1734 EMC 1%@1600 AVP 82%@13 GR3D 0%@76 EDP limit 1734
RAM 884/3994MB (lfb 91x4MB) cpu [17%,3%,50%,5%]@307 EMC 42%@68 AVP 85%@14 GR3D 0%@76 EDP limit 1734

SunYe · June 28, 2017, 9:17am

In intel i7 6700 + GTX960 video card (win 7 system),

use cuda decode , the cpu usage is apparent lower than ffmpeg decode,
but at tx1, the situation is different. WHY?

DaneLLL · July 5, 2017, 1:33am

Hi SunYe,
Are you able to get the same result as [url]https://devtalk.nvidia.com/default/topic/1014789/jetson-tx1/-the-cpu-usage-cannot-down-use-cuda-decode-/post/5175104/#5175104[/url] ?

What is the API you use when you say ‘cude decode’? The HW decoder on TX1/TX2 is individual HW engine, not GPU.

kayccc · July 13, 2017, 4:46am

Hi SunYe,

Any update? Could you share the result and progress?

Thanks

SunYe · July 17, 2017, 7:51am

project code cann’t share, because the corporation forbid.
the key tx1 decoder code I have share at 1 floor.

the result:
1、H264@10M (frameRate:25)
2 channel H264@10M ffmpeg decode drop frame，cpu 30%（include rtsp）
1 channel H264@10M ffmpeg decode is OK，cpu 25%（include rtsp）

2 channel H264@10M cuda decode ok, cpu 32%（include rtsp）
3 channel H264@10M cuda decode drop frame（include rtsp）

2、H265@10M (frameRate:25)
1 channel H265@10M ffmpeg decode drop frame，cpu 25%(private protocal get stream)
2 channel H264@10M cuda decode ok, cpu 30%, (private protocal get stream)
3 channel H264@10M cuda decode drop frame

cuda decode only can decode h264/h265@10M at 2 channel, cann’t more

SunYe · July 17, 2017, 7:53am

project code cann’t share, because the corporation forbid.
the key tx1 decoder code I have share at 1 floor.

the result:
1、H264@10M (frameRate:25)
2 channel H264@10M ffmpeg decode drop frame，cpu 30%（include rtsp）
1 channel H264@10M ffmpeg decode is OK，cpu 25%（include rtsp）

2 channel H264@10M cuda decode ok, cpu 32%（include rtsp）
3 channel H264@10M cuda decode drop frame（include rtsp）

2、H265@10M (frameRate:25)
1 channel H265@10M ffmpeg decode drop frame，cpu 25%(private protocal get stream)
2 channel H265@10M cuda decode ok, cpu 30%, (private protocal get stream)
3 channel H265@10M cuda decode drop frame

cuda decode only can decode h264/h265@10M at 2 channel, cann’t more

DaneLLL · July 17, 2017, 8:10am

We have verified four 1080p25 transcoding on TX1. It should also work for using MM APIs.
[url]https://devtalk.nvidia.com/default/topic/979908/jetson-tx1/gstreamer-transcoding-performance-issue/post/5033461/#5033461[/url]

And please let me emphasize again that it is not ‘cuda decode’. The HW decoder on TX1/TX2 is individual HW engine, not GPU.

SunYe · July 20, 2017, 2:27am

My Scene is Realtime Recevie Video Data And Decode, Not Transcoding.

If the average decode time cann’t below 1000/25= 40ms, the h264/h265 data will overflow,

if I put more than 2 channel h264/h265 data (10Mbit Data per second) into tx1 decoder ,

the tx1 decode engineer cann’t decode quick enough, so the h264/h265 data will overflow.

And please let me emphasize again that: our Scene is Realtime Decode, not offline Decode,

so the performance evaluate method is different. and the result is also different.

SunYe · July 20, 2017, 2:31am

you can use live555 to establish a rtsp service , and use cmd :
gst-launch-1.0 rtspsrc location=“rtsp://192.168.110.232/3.mkv” ! rtph264depay ! h264parse ! omxh264dec ! nvoverlaysink -e
to test the realtime decode performance.

the offline decode performance and realtime decode performance is different.

SunYe · July 20, 2017, 2:37am

I doesn’t use the cmd , I use my rtsp program to test.

and I doesn’t use live555 rtsp service ,I use ip camera for rtsp service.

the result is : tx1 realtime decode performance is 2 channel h264/h265 decode @ 10Mbit , 25framerate, cann’t more

DaneLLL · July 20, 2017, 7:10am

We have verified four 1080p25 @ 7.5Mbit streaming playback on two TX1-r24.2.1. One TX1 is as server and the other is as client.

Test video file: bourne_ultimatum_trailer.zip - Download The Bourne Ultimatum - High Definition (1080p) Theatrical Trailer - dvdloc8.com

Server
Compile gst-rtsp-server/test-mp4.c at master · GStreamer/gst-rtsp-server · GitHub
Start rtspserver

$ ./test-mp4 Bourne_Trailer.mp4

Client

$ export RTSP_PATH=rtsp://10.19.106.151:8554/test
$ gst-launch-1.0 rtspsrc location="$RTSP_PATH" ! rtph264depay ! h264parse ! omxh264dec ! nveglglessink window-x=100 window-y=100 window-width=640 window-height=360 & gst-launch-1.0 rtspsrc location="$RTSP_PATH" ! rtph264depay ! h264parse ! omxh264dec ! nveglglessink window-x=800 window-y=100 window-width=640 window-height=360 & gst-launch-1.0 rtspsrc location="$RTSP_PATH" ! rtph264depay ! h264parse ! omxh264dec ! nveglglessink window-x=100 window-y=500 window-width=640 window-height=360  & gst-launch-1.0 rtspsrc location="$RTSP_PATH" ! rtph264depay ! h264parse ! omxh264dec ! nveglglessink window-x=800 window-y=500 window-width=640 window-height=360

The result is identical to offline decoding.

SunYe · July 20, 2017, 9:49am

please test h264 @ 10Mbit, and h265 @ 10Mbit
And I suggest use a virtual machine linux for rtsp service
(Use task management graphics interface to see the bitstream is 10Mb or not)

SunYe · July 20, 2017, 9:53am

and you should record the video data to see if the video is drop frame or not.

Topic		Replies	Views
decode the rtsp stream from a IP camera Jetson TX2 opencv	25	9787	November 11, 2019
TX2 H264 RTSP Stream decoding issues Jetson TX2	27	11612	October 18, 2021
TX2 tegra_multimedia_api encode/decode issue Jetson TX2 mmapi	19	2238	October 18, 2021
TX2 decide H264 with tegra_multimedia_api Jetson TX2 mmapi	34	2089	October 18, 2021
[Jetson Xavier]Hardware video decode doesn't work Jetson AGX Xavier	12	1164	May 27, 2019
High CPU usage in omxh264dec and low performance as compared to nv_omx_h264dec Jetson TK1	20	4848	October 18, 2021
'Tegra_multimedia_api / samples / 00_video_decode' problem Jetson TX1	10	3476	October 18, 2021
Jetson goes curling (or, simultaneously viewing multiple IP-cams) Jetson TK1	33	12091	November 28, 2014
omxh264dec vs nv_omx_h264dec Jetson TK1	12	7782	October 18, 2021
Is busy CPU usage when jetson nano hardware decoding is doing task? Jetson Nano	18	2057	October 14, 2021

the cpu usage cannot down (use cuda decode)

Related topics