Jetpack4.2 based TX2 (Ubuntu18.04) GL performance issue

bk1472 · June 5, 2019, 7:25am

Hi

When Jetpack 4.2 was released last March
I used benchmark test in TX2 of Jetpack 3.2 environment and Jetpack 4.2 environment that I used.

Below I will share the test results with pictures.

However, I was not interested in CPU / Memory / File IO
There was a big difference when we did the GUI test with openGL.

My benchmark tool is glmark2.

Does the openGL library in Jetpack 4.2 require optimization?

kayccc · June 6, 2019, 3:46am

Hi bk1472,

We’re investigating to see what might be the cause.
Thanks for providing this test result.

bk1472 · June 11, 2019, 5:08am

Thank you for your reply.

Do you have any schedule for sw performance improvement for TX2?(jetpack4.2)

kayccc · June 20, 2019, 2:19am

Hi bk1472,

The next release will be releasing at early July, all the implementation and improvement are up and running.

Thanks

frederiki3k63 · July 9, 2019, 12:35pm

I would also be very interested in performance improvements to bring L4T 32.1 with Cuda 10.0 performance on par with L4T 28.2.1 with Cuda 9.0. I just updated to the new Jetpack 4.2 and noticed the following performance degradations:

running a custom frozen Tensorflow model (without TF-TRT or anything) consisting of a combined CNN and RNN went from 100 iterations in 7.32 seconds to 10.64 seconds, an increase of 45% runtime. Note here that the L4T 28.2.1-setup uses Tensorflow 1.8 and the L4T 32.1 setup uses Tensorflow 1.13.1
running trt-yolo-app (so basically a TensorRT engine) wrapped in pybind11 went from 100 batches in 16.55s to 24.28s, an increase of 47% runtime. However, the default trt-yolo-app DS2 application stayed constant in speed with 11.58ms per image to 11.61ms per image. I noticed there only the inference itself is timed.

This looks to be consistent with the findings of bk1472 - CUDA performance seems to be okay but the other parts of the Jetson TX2 system got slower with the new release. So much so that the decrease in speed makes Jetpack 4.2 unusable for me.

~~bk1472, could you tell me what tools you used to measure the CPU, Memory and Disk performance? I would like to reproduce your test results.~~

I also did some more benchmarks of my own, see this image. The results seem all over the place to me. Maybe it’s due to different versions of sysbench? (I just installed using apt-get). At least I reproduced your glmark2 results.

BENCHMARKS

EDIT: I updated my benchmark so as to use the same version of sysbench on both setups, this eliminates the discrepancies between the scores except for the I/O, which does seem to be twice as fast on Jetpack 4.2. I also did two ‘threads’ benchmarks and I noticed that with many threads, Jetpack 4.2 is about 4x slower!

Another salient point I noticed that might be of help:
Tensorflow 1.13.1 reports

name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.02

while Tensorflow 1.8 reports

name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005

, while tegrastats reports 1.3 GHz GPU memory clock for both.

bk1472 · July 9, 2019, 10:40pm

I used “sysbench” (can download from apt-get) and glmark2 (2017 released version)

I tested cuda performance with cuda release some sample program
(there is no difference between jetpack4.2 and jetpack3.x of cuda performance)
but I result of “glmark2”'s has so many difference.

I know the glmark2 is not regular tool of nvidia. but I can not find other tool

Thank
Regards

BK

bk1472 · July 9, 2019, 10:41pm

How can I download new Jetpack4.2?

linuxdev · July 9, 2019, 11:16pm

The most recent JetPack/SDK Manager is listed here:
[url]https://developer.nvidia.com/embedded/jetpack[/url]
(you might have to log in there and then hit the URL a second time since redirects don’t work correctly)

I don’t know when, but there will probably be a newer release soon.

frederiki3k63 · July 11, 2019, 8:40am

To verify the TensorRT speed issues, I have done some more testing on that front. I created this minimal example and ran it on Jetpack 3.2 and 4.2. For reference: the engine is set to kHALF with batch size 16 and the model is a variation on tiny-yolo-v3 network with a 512x288 input.

Jetpack 3.2 took 10.3198ms per image
Jetpack 4.2 took 11.6762ms per image

So in this case it’s a 12% speed decrease. When I alter the example so only the

m_inferNet->doInference(input.data())

part is measured, the results change to:

Jetpack 3.2 took 9.55303ms per image
Jetpack 4.2 took 9.35428ms per image

This confirms the earlier findings that Cuda seems to be just as fast in Jetpack 4.2 (or even slightly faster) while other things get a significant speed decrease. In this case, Nvidia’s own decodeDetections and nonMaximumSupression functions.

frederiki3k63 · July 15, 2019, 1:16pm

Okay after a week of constant frustration I managed to resolve my issues.

Regarding the Tensorflow speed issues, I managed to relieve them somewhat by compiling Tensorflow 1.8.0 for Python 3.6 myself. This reduces the speed reduction from 45% extra runtime to 11% extra runtime. So it is still slower but less so.

The TensorRT speed issues were not caused by the arithmetic in the decodeDetections and nonMaximumSupression functions, but the memory access. As explained in this thread, cudaMallocHost has changed its behavior in Cuda 10.0 so pinned memory cannot use CPU caching anymore. I resolved this by using cudaMallocManaged instead of cudaMallocHost in this line or the equivalent line in the DeepStream 2.0 repository (make sure to change cudaFreeHost to cudaFree too). I have double checked the output because there is a warning:

Note: The unified memory model requires the driver and system software to manage coherence on the current Tegra SOC. Software managed coherence is by nature non-deterministic and not recommended in a safe context. Zero-copy memory (pinned memory) is preferable in these applications.

and the output is still consistent.

truemonpark · August 5, 2019, 4:57am

Dear kayccc

I checked openGL performance using glmark2 benchmark test tool in JetPack 4.2.1(new version) with bk172.
But still openGL performance is lower than JetPack 3.2 (Ubuntu 16.04).
I think there is no improvement for openGL perforamnce in new version JetPack.(4.2 → 4.2.1)

How does Nvidia verify openGL performance for new OS / new JetPack?
Could you share your method/tool for verifying openGL performance and the results?

Thanks,
Junsin.

DaneLLL · August 8, 2019, 7:57am

Hi,
Do you run ‘sudo nvpmodel -m 0’ and ‘sudo jetcon_clocks’ before running the benchmark?

truemonpark · September 10, 2019, 6:45am

Dear DaneLLL

Sorry for late response.

We already run with ‘sudo nvpmodel -m 0’ and ‘sudo jetcon_clocks’ in both case (16.04_jetpack_3.2 & 18.04_jetpack_4.2.1).

Thanks,
Junsin.

kterius · November 12, 2019, 11:26pm

Dear DaneLLL

Is there any improvement of the performance for TX2?
I tested the video player(smplayer) with the simple MP4 file on TX2, but the result was different from Ubuntu 16.04 and 18.04 with the JetPack 4.2.1.
Ubuntu 16.04 seemed OK, but 18.04 showed the serious one - more than 40% CPU usage for XServer process.

It prevents me of going to Ubuntu 18.04 on TX2.

DaneLLL · November 13, 2019, 2:49am

Hi,
For video decoding with hardware acceleration, we support gstreamer and tegra_multimedia_api. suggest you try either solution.

The smplayer should be based on ffmpeg. There is a community contribution and you may give it a try.

We are also trying to include ffmpeg with hardware acceleration into L4T releases. It is undergoing.

Topic		Replies	Views
Performance downgrade on Jetpack 4.2 comparing to Jetpack 3.3 on TX2 Jetson TX2	9	1205	January 20, 2020
JetPack 4.2.1 - L4T R32.2 release for Jetson Nano, Jetson TX1/TX2, and Jetson AGX Xavier Jetson TX2	31	5627	August 11, 2023
TX2 running slow Jetson TX2	15	4431	August 29, 2019
Pytorch with jetpack 4.2 works slowly than 3.3 Jetson TX2	5	1484	April 26, 2019
JetPack 3.2 — L4T R28.2 Production Release for Jetson TX1/TX2 Announcements	0	914	March 9, 2018
JetPack 3.1 — L4T R28.1 released for Jetson TX1/TX2 Jetson TX2	42	8803	October 20, 2017
TX2 Computing Performance has Dropped Jetson TX2 power , performance	11	1143	June 4, 2020
JetPack 3.2 — L4T R28.2 Production Release for Jetson TX1/TX2 Jetson TX2	45	10805	July 22, 2018
JetPack 3.3 — L4T R28.2.1 Release for Jetson TX1/TX2 Jetson TX2	44	12076	May 16, 2019
Performance degradation on CUDA Jetson TX2	9	2403	September 20, 2018

Jetpack4.2 based TX2 (Ubuntu18.04) GL performance issue

Related topics