I want to make a 4k/30fps streaming server application on the Jetson TX1.
L4T: R24.1 (32bit)
Input: USB 3.0, YUV(I420), 3,840x2,160/30fps (via HDMItoUSB3 converter)
Output: Ethernet, RTP/UDP/IP
Encode Buffer: allocated by cudaHostAlloc()
I have coded a capture module with v4l2 interface userptr mode as a C++ program.
(Refered v4l2 sample code)
But it is over 40msec from enqueue to dequeue of OpenMAX Encoder,
it seems that I cannot achieve 30fps encoding.
In accordance with the release note R24.1, I’ve tried to maximize the CPU/GPU performance.
Unfortunately, it is almost the same before running scripts.
What is the maximum frequency of CPU core ?
→ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
… 1734000
(Is the max 1.734GHz ?)
(The GPU bus can be set to about 1GHz.)
Can I get 30fps performance with TX1 and OpenMAX IL API (without gstreamer element)?
Would anyone give advice to me how to achieve the performance ?
I’ve confirmed that the bandwidth of USB 3.0 is enough,
and the interval of capturing of 4k image via USB 3.0 is about 33msec(29-35msec) on a JETSON TX1 development board.
The size of 4k image,
3,840 x 2,160 = 8,294,400 pixels
1 YUV: 12bit x 8,294,400 pixels = 99,532,800 bits
bit rate: 99,532,800 bits x 30fps = 2.985,984 Gbps
The bandwidth of USB 3.0 is 5Gbps, it is enough to transfer 4k/30fps YUV I420.
I tried to,
(a) Capture 4k image via USB 3.0 — (1)
(b)-1 Enqueue 1 image to encode — (2)
(b)-2 Discard 1 image alternativly
(c) Dequeue H.264 code from encoder — (3)
The interval of (1) is about 33msec(29-35msec).
The differnce between (2) and (3) is over 40msec.
I suspect that the OpenMAX IL API is not well optimized for the maximum performance.
I’m still waiting for some advice and suggestion.
Here is the right computation:
4K in YUV 4:2:0 and 12bit
12bit x 3840 x 2160 x 1,5 x 30fps = 4,48Gbits/sec
1,5 because of I420 color space decimation
knowing that USB3.0 actual bandwidth is lower than 5Gbits/sec (USB bus is shared with other clients. Also implementation efficiency is less than max). you may end up with 50% of the 5Gbits/sec which is 2,5gbits/sec.
from this you see that even reaching 15fps is tuff.
However if the camera has inbuilt compression it may be a good idea to use that to transfer all images and do transcoding in the GPU afterwards.
Sorry my poor explanation.
“12bit” means Y:8bit + U:8bit x 1/4 + V:8bit x 1/4.
I already have successed to capture 4k/30fps.
My C++ code using V4L2 interface with usrprt mode, and waits for the finish of capturing an image by select() call.
The interval of these finish logs was 29-35msec.
Thanks your suggestion about ZHAW HDMI/CSI board, too.
I know it, and have asked its price.
But I’m using HDMI/USB converter.
[url]https://inogeni.com/4k-usb3-0/[/url]
This converter supports I420 format and UVC interface.
It can be configured via V4L2 ioctl().
Ok I see.
Did you have a look at the new encoding API in L4T24.2. This has been released 2 weeks ago. Nvidia claims to have improved the encoding API. There are several encoding examples.
We want to make a prototype to confirm that we obtain 4k/30fps streaming performance with Jetson TX1 development kit.
Unfortunately, the camera on the devkit has 5M resolution less than 4k.
And the other camera using the MIPI/CSI interface cannot use the ISP on TX1 to convert RAW (Bayer RGGB) to YUV.
Then we are using the USB 3.0 interface to input a 4k I420 video.
We have made C++ codes to V4L2 capture, to encode H.264 with OpenMAX and to send RTP packets.
And have already measured the time to copy between buffers ,to encode picture and to assemble and send packets.
When cudaHostAlloc() is used to get V4L2 UsrPtr capture buffer, memcpy() spent 50-60msec per 1 of 4k picture, we think that cuda have to be used instead of memcpy().
When malloc() (memalign) is used, memcpy() spent 10msec to copy from V4L2 buffer to OpenMAX buffer (using cudaHostAlloc).
It is over 40msec from enqueue (OpenMAX FillThisBuffer) to dequeue (OpenMAX H.264 FilledBuffer).
In each case, H.264 encoding spent over 40msec from enqueue to dequeue without memory copy.
I want to know the programing condition to get 4k/30fps encoding,
Others sensor can use the ISP on TX1 if you don’t care about the image tuning stuff. But the sensor bring up maybe an effort.
You can migrate to r24.2 to use multimedia api to use below api to use nv buffer.
int NvBufferCreate (int *dmabuf_fd, int width, int height, NvBufferLayout layout, NvBufferColorFormat colorFormat);
int NvBufferGetParams (int dmabuf_fd, NvBufferParams *params);
int NvBufferDestroy (int dmabuf_fd);
I’ve found “NvBufferDestroy()” and “createNvBuffer()” in sample source codes (R24.2).
But unfortunately I cannot see “NvBufferCreate()” nor “NvBufferGetParams()” in
(1) NVIDIA Tegra Multimedia API Framework Documentation (L4T R24.2)
(2) include/NvBuffer.h
(3) samples/common/classes/NvBuffer.cpp
(4) include/EGLStream/NV/ImageNativeBuffer.h
(5) gstomx1_src.tbz2
Would you please tell me what documents or source code I should refer ?
Thanks very very very much!
I could read the API specification in the pointed document.
“Tegra Linux Driver Package Developer Guide” → “Multimedia API Reference Documentation”
(1) Method NvBufferCreate() and NvBufferGetParams()
int NvBufferCreate (
int * dmabuf_fd,
int width,
int height,
NvBufferLayout layout,
NvBufferColorFormat colorFormat
)
Parameters
[out] dmabuf_fd Returns dmabuf_fd of hardware buffer.
[in] width Specifies the hardware buffer width, in bytes.
[in] height Specifies the hardware buffer height, in bytes.
[in] layout Specifies the layout of buffer.
[in] colorFormat Specifies the colorFormat of buffer.
Returns
0 for success, -1 for failure
int NvBufferGetParams ( int dmabuf_fd,
NvBufferParams * params
)
Parameters
[in] dmabuf_fd DMABUF FD of buffer.
[out] params A pointer to the structure to fill with parameters.
Returns
0 for success, -1 for failure.
(2) Data Structure _NvBufferParams (defined in nvbuf_utils.h)
I am having issues capturing 4k at 30fps using the same inogeni HDMI to USB 3.0 contraption. I am using a TK1 for this instead of a TX1, but results are the same for me on both. I am capturing 3140x2160@30Hz frames from the hdmi output of a TX1, encoding in the TK1 using ffmpeg/avconv/gst-launch-1.0/gst-launch-0.10 directly from /dev/videoX. But only getting 8-9fps instead of close to 30fps. The 1080p capture, with the same setup cranks at >25fps.
Unfotunately, I have no good information.
I’m using normal L4T R24.2 on our Jetson TX1s.
I have not modified nor configured the kernel and the USB3/V4L2 drivers.
My configuration of INOGENI via ioctl (v4l2 driver) is
3840x2160@30fps
YUV 4:2:0 (I420)
Are you using TK1 ?
Jetpack 2.3 includes;
TX1: L4T R24.2 64bit
TK1: L4T R21.5
I don’t know the difference between TK1 R21.5 and TX1 R24.2.
I cannot understand;
ffmpeg/avconv/gst-launch-1.0/gst-launch-0.10
I think that in L4T R24.2, there is only GStreamer 1.0 (without GStreamer 0.10).
Could you please test the performance without ffmpeg elements?
Video Source --(HDMI)–> INOGENI --(USB3)–> Jetson TK1 → /dev/videoN → v4l2src → some overlaysink
Does it perform to capture 30fps ?
And would you use nVIDIA System Profiler ?
I saw a capture thread (programmed by ourself) each 33msec (30fps/29-39msec).
I have not tuned OpenMAX H.264 encoder into 33msec (yet about 28-29fps).
Before using NvBuffer for the input buffer of H.264 encoder, I saw 15-20fps@3840x2160.
Thank you for answering. I am also using the same configuration as you for the inogeni: 3840x2160@30fps. I have updated the TK1 to the latest software/driver releases.
On the ffmpeg/avconv/gst-launch-1.0/gst-launch-0.10 line: I am sorry about the confussion, by that I meant that I get similar results by using ffmpeg, avconv, gst-launch-1.0 or gst-launch-0.10. The main difference is basically on CPU load, with gst-launch-0.10 with the nv_omx_h264enc being less CPU demanding.
The setup I have is:
Video Source --(HDMI)–> INOGENI --(USB3)–> Jetson TK1 → /dev/videoN → v4l2src → some overlaysi
No matter what overlaysink (/dev/null, file, rtmp server) I get 8-9 fps using one of the command line utilities. However I wrote a c program to grab the frames (3840x2160) using v4l2 API. I can see an improvement to >15 fps (just grabbing) but still well under the theoretical 30fps. This is using mmap, will try using pinned memory next to see if there is an improvement.
Next I will investigate the encoding API options. I will need to publish the captured frames to a remote rtmp server. I can see you are using OpenMAX encoder API, Is there any reason for this?