TX1 slower than TK1

julmp_N · August 8, 2016, 6:41am

I followed this guide to maximize the performance of my TX1.

http://elinux.org/Jetson/TX1_Controlling_Performance

I followed this one for the TK1.

I was surprised when I found the TK1 processed 20 frames per second while the TX1 only ran at 13-15. I should also mention that both are running ROS Indigo if that matters. Also, I noticed in the guide for the TX1 the article mentions

“There are many cores not mentioned above, exposed through /sys/kernel/debug and /sys/devices/system that can be experimented with for impacts to power scaling.”

What does this mean? I can’t test this right now because I’m away from the lab at the moment.

Really could use some help on this.

kayccc · August 10, 2016, 8:43am

Hi julmp_N,

“I was surprised when I found the TK1 processed 20 frames per second while the TX1 only ran at 13-15.”

What’s the benchmark tool or application you tested?

julmp_N · August 10, 2016, 9:06am

Sorry, forgot to mention it is an opencv program that uses the gpu module to process BGR2Luv using cvtColor.

I’m measuring the program’s speed using the time() function found in <time.h>

It is noticeably slower on the tx1 vs the tk1 (same code, compile command, and images).

kayccc · August 15, 2016, 1:26am

Hi julmp_N,

When we benchmark TK1 vs TX1, we have seen some degradation, and some improvement, by deep SW architecture level synchronization.

To confirm if this is the problem; you can look at CUDA profile log for the execution from both and compare the time spent on the GPU. TX1 should be faster, if not, the slowdown is caused by GPU architecture changes, and probably the code is sub-optimal. If you see GPU execution in TX1 is faster, you can improve the pipeline by better use of streaming and synchronization.

Hope this helps on your case.

Thanks

julmp_N · August 15, 2016, 2:55am

When I checked the logs te TX1 was definitely slower, I’m also running ROS on my tx1 but I don’t think that should change anything…

I’m looking around for examples on how to use the ZERO COPY (ALLOC_ZEROCOPY) flag as mentioned in the link, but I’m really struggling to find examples related to this.

The code that uses the GPU looks like this and I make sure to run the max performance script for the tx1 as well.

cv::gpu::GpuMat foo;

foo.upload(inputimage);

// do stuff

[/code]

dusty_nv · August 19, 2016, 4:03am

The following documentation includes the functions for allocating openCV cudaMem object with ALLOC_ZEROCOPY flag, then obtaining GpuMat from it: [url]Data Structures — OpenCV 2.4.13.7 documentation

Once you have the GpuMat, you can use it like normal, except you shouldn’t need to do the redundant upload/download copies anymore.

Topic		Replies	Views
Comparing TK1 and TX1 GPU specs with OpenCV4Tegra mog2 algorithm Jetson TX1	4	1037	October 18, 2021
Performance comparision TK1 vs TX1 Jetson TX1	6	3977	October 18, 2021
CUDA Kernel runs much slower on TX1 than on discrete GPU Jetson TX1	8	2512	March 2, 2016
Opencv4Tegra GPU vs CPU TK1 vs TX1 Jetson TX1 opencv	3	3675	April 28, 2016
Opencv code 4x slower on tx1 over tk1 Jetson TX1	1	735	June 21, 2016
Cuda 7.0 Jetson TX1 performance and benchmarks Jetson TX1	21	17218	March 16, 2017
OpenCV Benchmarks: opencv_perf_gpu Jetson TX1 opencv	2	2693	March 28, 2016
Lower GPU IPC on TX1 compared to TK1 Jetson TX1	7	1131	October 31, 2018
Performance issue on opencv TX1 Jetson TX1	8	819	October 18, 2021
confused,Our programs run on TX1 is slower than TK1. Jetson TX1	9	1166	October 18, 2021

TX1 slower than TK1

Related topics