Code performs poorly on tx2 compared to tx1

WANG-KX · June 5, 2017, 4:55pm

Hello !

We have encountered a strange problem, our code runs slower on tx2 compared with tx1. As we expected, the performance should be at least as good as on tx1, however it is usually two to three times slower for some programs. The setting for the code is exactly the same with that of tx1, except the -gencode=arch (tx2:62 tx1:53). Before we run on tx2, we also set tx2 environment using:

nvpmodel -m 0
./jetson_clocks.sh

We have packed the code and upload at https://github.com/WANG-KX/tx2_issue, you can downloa
d and run it. Details are also provided in the README file.

Is there any other environment settings? Why the code is so slow on tx2?

Thanks for your help!

Skypuppy · June 5, 2017, 8:07pm

Does the compiler have an -O3 setting? Or is that even valid for the TX series?

WANG-KX · June 6, 2017, 3:18am

Hi SKkypuppy,

Yes, we used -O3 setting when compiling the code. You can download the code https://github.com/WANG-KX/tx2_issue, it works fine on tx1 and titan but tx1.

The main question is that why it is so slow compared with tx1. It is a simple stereo matching code, Details ara:

TX2
the input image size: 1344 x 391. wta cost 41.571000 ms. sgm cost: 514.919000 ms.

TX1
the input image size: 1344 x 391. wta cost 18.380000 ms. sgm cost: 287.463000 ms.

(wta means winner takes all, sgm means semi global matching)

carolyuu · June 6, 2017, 8:36am

Hi WANG-KX,

Could you share the code for TX1?
We can try to repro issue and investigate.

Thanks!

WANG-KX · June 6, 2017, 9:04am

Hi carolyuu,

The code uploaded athttps://github.com/WANG-KX/tx2_issue is used at TX1, TX2 and TITAN XP, only the CMakeLists.txt is changed according to the arch of the respected device.

It is quite simple to compile and run.
just:

git clone https://github.com/WANG-KX/tx2_issue.git
cd tx2_issue
mkdir build
cd build
cmake ..
make
./sgm

do not froget to change the CMakeLists.txt from line 15 to 17 according to your device

Thanks for your help! Let me know if there is any problems or suggestions.

WayneWWW · June 6, 2017, 9:44am

Hi WANG-KX,

Please do not use clock() for profiling.

The elapsed time return from clock() is not correct on TX1.

Please refer to this topic:
https://devtalk.nvidia.com/default/topic/1000705/jetson-tx2/why-does-jetson-tx1-outperform-tx2-on-cufft

WANG-KX · June 6, 2017, 10:29am

Hi WayneWWW,

Thanks for your suggestion! I change the time measurement into gettimeofday and get consist time cost according to different device.
For reference:

TITAN XP the input image size: 1344 x 391. wta cost 4.095000 ms. sgm cost: 37.245998 ms.

TX2 the input image size: 1344 x 391. wta cost: 43.356998 ms. sgm cost: 515.893005 ms.

TX1 the input image size: 1344 x 391. wta cost 49.083000 ms. sgm cost: 671.835999 ms.

Does this seem right with respect to the performance improvement?

snarky · June 6, 2017, 4:56pm

I’ve found the best clock for profiling on linux is CLOCK_MONOTONIC_RAW.

#include <time.h>

struct timespec tspec;
clock_gettime(CLOCK_MONOTONIC_RAW, &tspec);
double secondsTimestamp = tspec.tv_nsec * 1e-9 + tspec.tv_sec;

Topic		Replies	Views
CUDA performance issue on tx2 Jetson TX2	8	4260	October 18, 2021
Code execution slower after flashing jetson TX2 with jetpack l4t3.1 Jetson TX2	8	1246	October 18, 2021
Jetson TX1 poor performance relative to TX2 Jetson TX1	6	698	October 18, 2021
confused,Our programs run on TX1 is slower than TK1. Jetson TX1	9	1159	October 18, 2021
TX1 slower than TK1 Jetson TX1	5	1314	August 19, 2016
TX2 running slow Jetson TX2	16	4141	October 18, 2021
Question about timing of TX2 Jetson TX2 cuda	2	378	October 18, 2021
TX2_ Why is the performance of NX lower than that of Xavier NX Jetson TX2 performance	11	1379	October 18, 2021
Opencv code 4x slower on tx1 over tk1 Jetson TX1	1	729	June 21, 2016
cuda question on tx2 Jetson TX2	8	1244	October 18, 2021

Code performs poorly on tx2 compared to tx1

Related topics