TX1 slower than TK1

I followed this guide to maximize the performance of my TX1.

http://elinux.org/Jetson/TX1_Controlling_Performance

I followed this one for the TK1.

http://elinux.org/Jetson/Performance

I was surprised when I found the TK1 processed 20 frames per second while the TX1 only ran at 13-15. I should also mention that both are running ROS Indigo if that matters. Also, I noticed in the guide for the TX1 the article mentions

“There are many cores not mentioned above, exposed through /sys/kernel/debug and /sys/devices/system that can be experimented with for impacts to power scaling.”

What does this mean? I can’t test this right now because I’m away from the lab at the moment.

Really could use some help on this.

Hi julmp_N,

“I was surprised when I found the TK1 processed 20 frames per second while the TX1 only ran at 13-15.”

What’s the benchmark tool or application you tested?

Sorry, forgot to mention it is an opencv program that uses the gpu module to process BGR2Luv using cvtColor.

I’m measuring the program’s speed using the time() function found in <time.h>

http://pastebin.com/cGLjKHnt

It is noticeably slower on the tx1 vs the tk1 (same code, compile command, and images).

Hi julmp_N,

When we benchmark TK1 vs TX1, we have seen some degradation, and some improvement, by deep SW architecture level synchronization.

To confirm if this is the problem; you can look at CUDA profile log for the execution from both and compare the time spent on the GPU. TX1 should be faster, if not, the slowdown is caused by GPU architecture changes, and probably the code is sub-optimal. If you see GPU execution in TX1 is faster, you can improve the pipeline by better use of streaming and synchronization.

Hope this helps on your case.

Thanks

https://devtalk.nvidia.com/default/topic/894945/jetson-tx1/jetson-tx1/post/4950721/#4950721

When I checked the logs te TX1 was definitely slower, I’m also running ROS on my tx1 but I don’t think that should change anything…

I’m looking around for examples on how to use the ZERO COPY (ALLOC_ZEROCOPY) flag as mentioned in the link, but I’m really struggling to find examples related to this.

The code that uses the GPU looks like this and I make sure to run the max performance script for the tx1 as well.

cv::gpu::GpuMat foo;

foo.upload(inputimage);

// do stuff

[/code]

The following documentation includes the functions for allocating openCV cudaMem object with ALLOC_ZEROCOPY flag, then obtaining GpuMat from it: http://docs.opencv.org/2.4/modules/gpu/doc/data_structures.html#gpu-cudamem

Once you have the GpuMat, you can use it like normal, except you shouldn’t need to do the redundant upload/download copies anymore.