Why there is no difference in performance between tx2 and xavier?(Deep Learning Speed)

I recently bought xavier expecting faster processing time to test.
But there is no big difference to tx2 which I bought few months ago and even slower than tx2.
To test one image(using tensorflow), it takes 19ms on xavier, while it takes 14ms on tx2(both were under m0 nvpmodel condition (well known for best clock frq)).
I saw that xavier better performs 20 times tx2 in DL. And considering the number of gpu cores, at least performance of xavier should be twice that of tx2.
Please give me any advice to deal with speed problem and tell me why these happen


It’s recommended to use TensorRT.
You can get better improvement with TensorRT.


While I think TensorRT will increase your speed (well, it should), I don’t see why “naively” running your previous code-base should result in poor performance. Would you mind sharing more details on what you’re trying to run. I’m quite curious. I’ve tried running some of my TX-2 related deep learning networks and while I didn’t see a 20 times speedup, I did notice roughly 1.5x speedup on all networks straight out of the box.

Hi, I had also came across the same situation. The ShuffleSeq is only 1fps faster than the tx2 and sometimes even slower. The native Yolov2 could not detect anything but the speed is only 3fps. Since transparenting to TensorRT is a pain, anyone could provide a clue of how to improve the performance on xavier? At least the extra cuda cores should be benificial.

Hi, all

Would you mind to share your model or the sample you are working on.
We want to check them in detail and feedback this issue to our internal team.


The default mode of Xavier is TX2 simulation mode (nvpmodel 2)

Besides that, the config to turn on two NVDLAs module is essential for high performance (how to make sure/turn on?).

You can only use the NVDLA units by using TensorRT. Thus, if you want to run Yolo, you have to port it to using TensorRT. (And then make sure it runs in fp16 or int8 mode!)

Yes, “The default mode is 30W (id:5)” in the table in https://docs.nvidia.com/jetson/archives/l4t-archived/l4t-3102/index.html#page/Tegra%2520Linux%2520Driver%2520Package%2520Development%2520Guide%2Fpower_management_jetson_xavier.html%23wwpID0E0OL0HA will be corrected in next doc release.

I have the similar problem. Same object detection script was run on both Xavier and TX2. Xavier processed at 34 fps and TX2 at 27 fps. The improvement was not significant. TensorRT is not applicable here due to customized layers.

Then you need to profile your application and figure out where it’s spending its time, and how to improve that bottleneck.
Maybe there’s not enough parallelism so you can’t actually use the more CUDA cores available in the Xavier GPU, for example?
Maybe you’re not even using CUDA? Hard to know before you have solid data.