I was surprised when I found the TK1 processed 20 frames per second while the TX1 only ran at 13-15. I should also mention that both are running ROS Indigo if that matters. Also, I noticed in the guide for the TX1 the article mentions
“There are many cores not mentioned above, exposed through /sys/kernel/debug and /sys/devices/system that can be experimented with for impacts to power scaling.”
What does this mean? I can’t test this right now because I’m away from the lab at the moment.
When we benchmark TK1 vs TX1, we have seen some degradation, and some improvement, by deep SW architecture level synchronization.
To confirm if this is the problem; you can look at CUDA profile log for the execution from both and compare the time spent on the GPU. TX1 should be faster, if not, the slowdown is caused by GPU architecture changes, and probably the code is sub-optimal. If you see GPU execution in TX1 is faster, you can improve the pipeline by better use of streaming and synchronization.
When I checked the logs te TX1 was definitely slower, I’m also running ROS on my tx1 but I don’t think that should change anything…
I’m looking around for examples on how to use the ZERO COPY (ALLOC_ZEROCOPY) flag as mentioned in the link, but I’m really struggling to find examples related to this.
The code that uses the GPU looks like this and I make sure to run the max performance script for the tx1 as well.
cv::gpu::GpuMat foo;
foo.upload(inputimage);
// do stuff
The following documentation includes the functions for allocating openCV cudaMem object with ALLOC_ZEROCOPY flag, then obtaining GpuMat from it: [url]Data Structures — OpenCV 2.4.13.7 documentation
Once you have the GpuMat, you can use it like normal, except you shouldn’t need to do the redundant upload/download copies anymore.