Jetpack4.2 based TX2 (Ubuntu18.04) GL performance issue

Hi

When Jetpack 4.2 was released last March
I used benchmark test in TX2 of Jetpack 3.2 environment and Jetpack 4.2 environment that I used.

Below I will share the test results with pictures.

However, I was not interested in CPU / Memory / File IO
There was a big difference when we did the GUI test with openGL.

My benchmark tool is glmark2.

Does the openGL library in Jetpack 4.2 require optimization?

Hi bk1472,

We’re investigating to see what might be the cause.
Thanks for providing this test result.

Thank you for your reply.

Do you have any schedule for sw performance improvement for TX2?(jetpack4.2)

Hi bk1472,

The next release will be releasing at early July, all the implementation and improvement are up and running.

Thanks

I would also be very interested in performance improvements to bring L4T 32.1 with Cuda 10.0 performance on par with L4T 28.2.1 with Cuda 9.0. I just updated to the new Jetpack 4.2 and noticed the following performance degradations:

  • running a custom frozen Tensorflow model (without TF-TRT or anything) consisting of a combined CNN and RNN went from 100 iterations in 7.32 seconds to 10.64 seconds, an increase of 45% runtime. Note here that the L4T 28.2.1-setup uses Tensorflow 1.8 and the L4T 32.1 setup uses Tensorflow 1.13.1

  • running trt-yolo-app (so basically a TensorRT engine) wrapped in pybind11 went from 100 batches in 16.55s to 24.28s, an increase of 47% runtime. However, the default trt-yolo-app DS2 application stayed constant in speed with 11.58ms per image to 11.61ms per image. I noticed there only the inference itself is timed.

This looks to be consistent with the findings of bk1472 - CUDA performance seems to be okay but the other parts of the Jetson TX2 system got slower with the new release. So much so that the decrease in speed makes Jetpack 4.2 unusable for me.

bk1472, could you tell me what tools you used to measure the CPU, Memory and Disk performance? I would like to reproduce your test results.

I also did some more benchmarks of my own, see this image. The results seem all over the place to me. Maybe it’s due to different versions of sysbench? (I just installed using apt-get). At least I reproduced your glmark2 results.

BENCHMARKS

EDIT: I updated my benchmark so as to use the same version of sysbench on both setups, this eliminates the discrepancies between the scores except for the I/O, which does seem to be twice as fast on Jetpack 4.2. I also did two ‘threads’ benchmarks and I noticed that with many threads, Jetpack 4.2 is about 4x slower!

Another salient point I noticed that might be of help:
Tensorflow 1.13.1 reports

name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.02

while Tensorflow 1.8 reports

name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005

, while tegrastats reports 1.3 GHz GPU memory clock for both.

I used “sysbench” (can download from apt-get) and glmark2 (2017 released version)

I tested cuda performance with cuda release some sample program
(there is no difference between jetpack4.2 and jetpack3.x of cuda performance)
but I result of “glmark2”'s has so many difference.

I know the glmark2 is not regular tool of nvidia. but I can not find other tool

Thank
Regards

BK

How can I download new Jetpack4.2?

The most recent JetPack/SDK Manager is listed here:
https://developer.nvidia.com/embedded/jetpack
(you might have to log in there and then hit the URL a second time since redirects don’t work correctly)

I don’t know when, but there will probably be a newer release soon.

To verify the TensorRT speed issues, I have done some more testing on that front. I created this minimal example and ran it on Jetpack 3.2 and 4.2. For reference: the engine is set to kHALF with batch size 16 and the model is a variation on tiny-yolo-v3 network with a 512x288 input.

Jetpack 3.2 took 10.3198ms per image
Jetpack 4.2 took 11.6762ms per image

So in this case it’s a 12% speed decrease. When I alter the example so only the

m_inferNet->doInference(input.data())

part is measured, the results change to:

Jetpack 3.2 took 9.55303ms per image
Jetpack 4.2 took 9.35428ms per image

This confirms the earlier findings that Cuda seems to be just as fast in Jetpack 4.2 (or even slightly faster) while other things get a significant speed decrease. In this case, Nvidia’s own decodeDetections and nonMaximumSupression functions.

Okay after a week of constant frustration I managed to resolve my issues.

Regarding the Tensorflow speed issues, I managed to relieve them somewhat by compiling Tensorflow 1.8.0 for Python 3.6 myself. This reduces the speed reduction from 45% extra runtime to 11% extra runtime. So it is still slower but less so.

The TensorRT speed issues were not caused by the arithmetic in the decodeDetections and nonMaximumSupression functions, but the memory access. As explained in this thread, cudaMallocHost has changed its behavior in Cuda 10.0 so pinned memory cannot use CPU caching anymore. I resolved this by using cudaMallocManaged instead of cudaMallocHost in this line or the equivalent line in the DeepStream 2.0 repository (make sure to change cudaFreeHost to cudaFree too). I have double checked the output because there is a warning:

Note: The unified memory model requires the driver and system software to manage coherence on the current Tegra SOC. Software managed coherence is by nature non-deterministic and not recommended in a safe context. Zero-copy memory (pinned memory) is preferable in these applications.

and the output is still consistent.

Dear kayccc

I checked openGL performance using glmark2 benchmark test tool in JetPack 4.2.1(new version) with bk172.
But still openGL performance is lower than JetPack 3.2 (Ubuntu 16.04).
I think there is no improvement for openGL perforamnce in new version JetPack.(4.2 -> 4.2.1)

How does Nvidia verify openGL performance for new OS / new JetPack?
Could you share your method/tool for verifying openGL performance and the results?

Thanks,
Junsin.

Hi,
Do you run ‘sudo nvpmodel -m 0’ and ‘sudo jetcon_clocks’ before running the benchmark?

Dear DaneLLL

Sorry for late response.

We already run with ‘sudo nvpmodel -m 0’ and ‘sudo jetcon_clocks’ in both case (16.04_jetpack_3.2 & 18.04_jetpack_4.2.1).

Thanks,
Junsin.

Dear DaneLLL

Is there any improvement of the performance for TX2?
I tested the video player(smplayer) with the simple MP4 file on TX2, but the result was different from Ubuntu 16.04 and 18.04 with the JetPack 4.2.1.
Ubuntu 16.04 seemed OK, but 18.04 showed the serious one - more than 40% CPU usage for XServer process.

It prevents me of going to Ubuntu 18.04 on TX2.

Hi,
For video decoding with hardware acceleration, we support gstreamer and tegra_multimedia_api. suggest you try either solution.

The smplayer should be based on ffmpeg. There is a community contribution and you may give it a try.
https://github.com/jocover/jetson-ffmpeg

We are also trying to include ffmpeg with hardware acceleration into L4T releases. It is undergoing.