CUDA performance issue on tx2

When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange.

We noticed that TX2 have twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least.

Unfortunately, Most our code base spent twice time as TX1, in an other words, TX2 only have 1/2 speed as TX1, mostly. After we logged all small step which invoked CUDA APIs. We believe that TX2’s CUDA API do computed slow than TX1 in many cases.

Here’s a third party public repo which could reproduce my statement:

test on both tx1 and tx2 using example image:
tx1: about 57ms per frame (17fps).
tx2: about 258ms per frame (3 fps).

about four times slower.

Test environment:
TX1 ubuntu 14.04 cuda 7.0.74 normal usage, no more power settings
TX2 ubuntu 16.04 cuda 8.0.62 nvpmodel -m 0

Any suggestion about how to improve the TX2’s GPU performance?

Thanks

Hi,

TX2 is pascal architecture, requiring sm_62 compiling options.

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 7e4f3e6..c1d8150 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -15,8 +15,12 @@
 #    You should have received a copy of the GNU General Public License
 #    along with sgm.  If not, see <http://www.gnu.org/licenses/>.
 
+set(CUDA_USE_STATIC_CUDA_RUNTIME OFF)
+
 cmake_minimum_required(VERSION 2.4)
+
 project(sgm)
+
 find_package( OpenCV REQUIRED )
 find_package( CUDA REQUIRED )
 
@@ -24,10 +28,7 @@ set(
     CUDA_NVCC_FLAGS
     ${CUDA_NVCC_FLAGS};
     -O3 -lineinfo
-    -gencode=arch=compute_30,code=sm_30
-    -gencode=arch=compute_35,code=sm_35
-    -gencode=arch=compute_50,code=sm_50
-    -gencode=arch=compute_52,code=sm_52
+    -gencode=arch=compute_62,code=sm_62
     )
 
 cuda_add_executable(

nvidia@tegra-ubuntu:~/sgm/build$ ./sgm …/example/ 0 0
It took an average of 33.4728 miliseconds, 29.875 fps

Please add correct architecture into CMakeLists and let us know the results.
Thanks.

Hi AastaLLL,

I think we tried the sm_62 before, the main difference is the following line:

set(CUDA_USE_STATIC_CUDA_RUNTIME OFF)

it still took around 260ms, whether we enabled this line or not.
But when we start using this CMake settings, the running time would become unstable for each frame.Sometime, it would appear 13fps, but mostly, it still 3.8 fps.

Is there any extra compile settings or environment settings we need to do?

and here is the final CMakeList we use:

set(CUDA_USE_STATIC_CUDA_RUNTIME OFF)
cmake_minimum_required(VERSION 2.4)
project(sgm)
find_package( OpenCV REQUIRED )
find_package( CUDA REQUIRED )

set(
    CUDA_NVCC_FLAGS
    ${CUDA_NVCC_FLAGS};
    -O3 -lineinfo
    -gencode=arch=compute_62,code=sm_62
    )

cuda_add_executable(
    sgm
    main.cu median_filter.cu hamming_cost.cu disparity_method.cu debug.cu costs.cu)

target_link_libraries( sgm ${OpenCV_LIBS} )

Thanks

Hi,

I setup my environment with:

  1. JetPack3.0
  2. Set nvpmodel to max-N
  3. Run jeston_clock.sh.

Could you also test the image contained in the sgm github and let us know the results.
Thanks.

thanks, we have the correct performance. Seems the jetson_clock.sh have some extra settings over start the Fan. PLZ close this issue, thanks

I’m familiar with the above codebase and I got the same results as mentioned by AastaLLL. The important thing is to change your nvpmodel and run jetson_clocks.

One would think those would be the default settings, on boot and in the compiler optimizations.

Hi.

I’d like to thank AastaLLL since it also solve my problem (running time being very different from a run to another).
I was aware of changing the mode with nvpmodel, but as william_wu, I didn’t know jetson_clock.sh could help. I have never seen this info anywhere else.