Code performs poorly on tx2 compared to tx1

Hello !

We have encountered a strange problem, our code runs slower on tx2 compared with tx1. As we expected, the performance should be at least as good as on tx1, however it is usually two to three times slower for some programs. The setting for the code is exactly the same with that of tx1, except the -gencode=arch (tx2:62 tx1:53). Before we run on tx2, we also set tx2 environment using:

nvpmodel -m 0

We have packed the code and upload at, you can downloa
d and run it. Details are also provided in the README file.

Is there any other environment settings? Why the code is so slow on tx2?

Thanks for your help!

Does the compiler have an -O3 setting? Or is that even valid for the TX series?

Hi SKkypuppy,

Yes, we used -O3 setting when compiling the code. You can download the code, it works fine on tx1 and titan but tx1.

The main question is that why it is so slow compared with tx1. It is a simple stereo matching code, Details ara:

the input image size: 1344 x 391. wta cost 41.571000 ms. sgm cost: 514.919000 ms.

the input image size: 1344 x 391. wta cost 18.380000 ms. sgm cost: 287.463000 ms.

(wta means winner takes all, sgm means semi global matching)


Could you share the code for TX1?
We can try to repro issue and investigate.


Hi carolyuu,

The code uploaded at is used at TX1, TX2 and TITAN XP, only the CMakeLists.txt is changed according to the arch of the respected device.

It is quite simple to compile and run.

git clone
cd tx2_issue
mkdir build
cd build
cmake ..

do not froget to change the CMakeLists.txt from line 15 to 17 according to your device

Thanks for your help! Let me know if there is any problems or suggestions.


Please do not use clock() for profiling.

The elapsed time return from clock() is not correct on TX1.

Please refer to this topic:

Hi WayneWWW,

Thanks for your suggestion! I change the time measurement into gettimeofday and get consist time cost according to different device.
For reference:

TITAN XP the input image size: 1344 x 391. wta cost 4.095000 ms. sgm cost: 37.245998 ms.

TX2 the input image size: 1344 x 391. wta cost: 43.356998 ms. sgm cost: 515.893005 ms.

TX1 the input image size: 1344 x 391. wta cost 49.083000 ms. sgm cost: 671.835999 ms.

Does this seem right with respect to the performance improvement?

I’ve found the best clock for profiling on linux is CLOCK_MONOTONIC_RAW.

#include <time.h>

struct timespec tspec;
clock_gettime(CLOCK_MONOTONIC_RAW, &tspec);
double secondsTimestamp = tspec.tv_nsec * 1e-9 + tspec.tv_sec;