I would also be very interested in performance improvements to bring L4T 32.1 with Cuda 10.0 performance on par with L4T 28.2.1 with Cuda 9.0. I just updated to the new Jetpack 4.2 and noticed the following performance degradations:
running a custom frozen Tensorflow model (without TF-TRT or anything) consisting of a combined CNN and RNN went from 100 iterations in 7.32 seconds to 10.64 seconds, an increase of 45% runtime. Note here that the L4T 28.2.1-setup uses Tensorflow 1.8 and the L4T 32.1 setup uses Tensorflow 1.13.1
running trt-yolo-app (so basically a TensorRT engine) wrapped in pybind11 went from 100 batches in 16.55s to 24.28s, an increase of 47% runtime. However, the default trt-yolo-app DS2 application stayed constant in speed with 11.58ms per image to 11.61ms per image. I noticed there only the inference itself is timed.
This looks to be consistent with the findings of bk1472 - CUDA performance seems to be okay but the other parts of the Jetson TX2 system got slower with the new release. So much so that the decrease in speed makes Jetpack 4.2 unusable for me.
bk1472, could you tell me what tools you used to measure the CPU, Memory and Disk performance? I would like to reproduce your test results.
I also did some more benchmarks of my own, see this image. The results seem all over the place to me. Maybe it’s due to different versions of sysbench? (I just installed using apt-get). At least I reproduced your glmark2 results.
EDIT: I updated my benchmark so as to use the same version of sysbench on both setups, this eliminates the discrepancies between the scores except for the I/O, which does seem to be twice as fast on Jetpack 4.2. I also did two ‘threads’ benchmarks and I noticed that with many threads, Jetpack 4.2 is about 4x slower!
Another salient point I noticed that might be of help:
Tensorflow 1.13.1 reports
I used “sysbench” (can download from apt-get) and glmark2 (2017 released version)
I tested cuda performance with cuda release some sample program
(there is no difference between jetpack4.2 and jetpack3.x of cuda performance)
but I result of “glmark2”'s has so many difference.
I know the glmark2 is not regular tool of nvidia. but I can not find other tool
The most recent JetPack/SDK Manager is listed here:
[url]https://developer.nvidia.com/embedded/jetpack[/url]
(you might have to log in there and then hit the URL a second time since redirects don’t work correctly)
I don’t know when, but there will probably be a newer release soon.
To verify the TensorRT speed issues, I have done some more testing on that front. I created this minimal example and ran it on Jetpack 3.2 and 4.2. For reference: the engine is set to kHALF with batch size 16 and the model is a variation on tiny-yolo-v3 network with a 512x288 input.
Jetpack 3.2 took 10.3198ms per image
Jetpack 4.2 took 11.6762ms per image
So in this case it’s a 12% speed decrease. When I alter the example so only the
m_inferNet->doInference(input.data())
part is measured, the results change to:
Jetpack 3.2 took 9.55303ms per image
Jetpack 4.2 took 9.35428ms per image
This confirms the earlier findings that Cuda seems to be just as fast in Jetpack 4.2 (or even slightly faster) while other things get a significant speed decrease. In this case, Nvidia’s own decodeDetections and nonMaximumSupression functions.
Okay after a week of constant frustration I managed to resolve my issues.
Regarding the Tensorflow speed issues, I managed to relieve them somewhat by compiling Tensorflow 1.8.0 for Python 3.6 myself. This reduces the speed reduction from 45% extra runtime to 11% extra runtime. So it is still slower but less so.
The TensorRT speed issues were not caused by the arithmetic in the decodeDetections and nonMaximumSupression functions, but the memory access. As explained in this thread, cudaMallocHost has changed its behavior in Cuda 10.0 so pinned memory cannot use CPU caching anymore. I resolved this by using cudaMallocManaged instead of cudaMallocHost in this line or the equivalent line in the DeepStream 2.0 repository (make sure to change cudaFreeHost to cudaFree too). I have double checked the output because there is a warning:
Note: The unified memory model requires the driver and system software to manage coherence on the current Tegra SOC. Software managed coherence is by nature non-deterministic and not recommended in a safe context. Zero-copy memory (pinned memory) is preferable in these applications.
I checked openGL performance using glmark2 benchmark test tool in JetPack 4.2.1(new version) with bk172.
But still openGL performance is lower than JetPack 3.2 (Ubuntu 16.04).
I think there is no improvement for openGL perforamnce in new version JetPack.(4.2 → 4.2.1)
How does Nvidia verify openGL performance for new OS / new JetPack?
Could you share your method/tool for verifying openGL performance and the results?
Is there any improvement of the performance for TX2?
I tested the video player(smplayer) with the simple MP4 file on TX2, but the result was different from Ubuntu 16.04 and 18.04 with the JetPack 4.2.1.
Ubuntu 16.04 seemed OK, but 18.04 showed the serious one - more than 40% CPU usage for XServer process.