CUDA performance issue on tx2

william_wu · June 3, 2017, 11:48am

When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange.

We noticed that TX2 have twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least.

Unfortunately, Most our code base spent twice time as TX1, in an other words, TX2 only have 1/2 speed as TX1, mostly. After we logged all small step which invoked CUDA APIs. We believe that TX2’s CUDA API do computed slow than TX1 in many cases.

Here’s a third party public repo which could reproduce my statement:

test on both tx1 and tx2 using example image:
tx1: about 57ms per frame (17fps).
tx2: about 258ms per frame (3 fps).
about four times slower.

Test environment:
TX1 ubuntu 14.04 cuda 7.0.74 normal usage, no more power settings
TX2 ubuntu 16.04 cuda 8.0.62 nvpmodel -m 0

Any suggestion about how to improve the TX2’s GPU performance?

Thanks

AastaLLL · June 5, 2017, 3:20am

Hi,

TX2 is pascal architecture, requiring sm_62 compiling options.

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 7e4f3e6..c1d8150 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -15,8 +15,12 @@
 #    You should have received a copy of the GNU General Public License
 #    along with sgm.  If not, see <http://www.gnu.org/licenses/>.
 
+set(CUDA_USE_STATIC_CUDA_RUNTIME OFF)
+
 cmake_minimum_required(VERSION 2.4)
+
 project(sgm)
+
 find_package( OpenCV REQUIRED )
 find_package( CUDA REQUIRED )
 
@@ -24,10 +28,7 @@ set(
     CUDA_NVCC_FLAGS
     ${CUDA_NVCC_FLAGS};
     -O3 -lineinfo
-    -gencode=arch=compute_30,code=sm_30
-    -gencode=arch=compute_35,code=sm_35
-    -gencode=arch=compute_50,code=sm_50
-    -gencode=arch=compute_52,code=sm_52
+    -gencode=arch=compute_62,code=sm_62
     )
 
 cuda_add_executable(

nvidia@tegra-ubuntu:~/sgm/build$ ./sgm …/example/ 0 0
It took an average of 33.4728 miliseconds, 29.875 fps

Please add correct architecture into CMakeLists and let us know the results.
Thanks.

william_wu · June 5, 2017, 7:43am

Hi AastaLLL,

I think we tried the sm_62 before, the main difference is the following line:

set(CUDA_USE_STATIC_CUDA_RUNTIME OFF)

it still took around 260ms, whether we enabled this line or not.
But when we start using this CMake settings, the running time would become unstable for each frame.Sometime, it would appear 13fps, but mostly, it still 3.8 fps.

Is there any extra compile settings or environment settings we need to do?

and here is the final CMakeList we use:

set(CUDA_USE_STATIC_CUDA_RUNTIME OFF)
cmake_minimum_required(VERSION 2.4)
project(sgm)
find_package( OpenCV REQUIRED )
find_package( CUDA REQUIRED )

set(
    CUDA_NVCC_FLAGS
    ${CUDA_NVCC_FLAGS};
    -O3 -lineinfo
    -gencode=arch=compute_62,code=sm_62
    )

cuda_add_executable(
    sgm
    main.cu median_filter.cu hamming_cost.cu disparity_method.cu debug.cu costs.cu)

target_link_libraries( sgm ${OpenCV_LIBS} )

Thanks

AastaLLL · June 5, 2017, 8:07am

Hi,

I setup my environment with:

JetPack3.0
Set nvpmodel to max-N
Run jeston_clock.sh.

Could you also test the image contained in the sgm github and let us know the results.
Thanks.

william_wu · June 5, 2017, 8:16am

thanks, we have the correct performance. Seems the jetson_clock.sh have some extra settings over start the Fan. PLZ close this issue, thanks

singularity7 · June 5, 2017, 11:24am

I’m familiar with the above codebase and I got the same results as mentioned by AastaLLL. The important thing is to change your nvpmodel and run jetson_clocks.

Skypuppy · June 5, 2017, 8:12pm

One would think those would be the default settings, on boot and in the compiler optimizations.

Clement · July 20, 2017, 4:13pm

Hi.

I’d like to thank AastaLLL since it also solve my problem (running time being very different from a run to another).
I was aware of changing the mode with nvpmodel, but as william_wu, I didn’t know jetson_clock.sh could help. I have never seen this info anywhere else.

Topic		Replies	Views
cuda question on tx2 Jetson TX2	8	1244	October 18, 2021
TX2 Computing Performance has Dropped Jetson TX2 power , performance	12	969	October 18, 2021
TX2 with FP16 Running Slower than FP32 Jetson TX2	22	4221	October 18, 2021
Why is CUDA tex2D extremely slow on TX2? Jetson TX2	8	797	October 18, 2021
FP16 vs FP32 CUDA Programming and Performance	3	2384	May 23, 2019
Code performs poorly on tx2 compared to tx1 Jetson TX2	8	1103	October 18, 2021
Jetson TX2 is slower than 5.0 device? Jetson TX2	4	1280	October 18, 2021
CUDA code too slow Jetson Nano cuda	6	1803	July 26, 2022
Array + Array (1D or 2D): Why is performance of my code TERRIBLE? CUDA Programming and Performance cuda , image-processing	6	69	October 21, 2024
Problem: Cuda on Qt slower than naive code Jetson TX2	5	671	October 18, 2021

CUDA performance issue on tx2

Related topics