I have compared the performance of TK1 vs TX1. I made some image processing using different sizes of images (1920 x 184 pixel, 1920 x 300 pixel, 1920 x 1200 pixel). The results:
36 ms for 1200 px
12 ms for 300 px
8 ms for 184 px
41 ms for 1200 px
13 ms for 300 px
9 ms for 184 px
Why is there almost no difference for small images? The warp size is the same. (but I am not using CUDA kernels, I am using OpenCV4Tegra)
It’s quite a complicated problem to compare performance.
First, I’m not sure what kind of algorithm you are running. Is GPU acceleration applied?
Generally, for GPU acceleration, there are some extra overheads. So for smaller picture, the processing acceleration may be less obvious than bigger pictures.
For you case, you can also check the system status by ‘tegrastats’. Probably, the system does not run in max state.
I am running the gaussian filter enginge and threshold (both via OpenCV with GPU acceleration) and the function findContours (not with GPU). I have edited the rc.local file with https://github.com/yongxu/tx1-max-perf-script/blob/master/max_perf_script.sh
and a corresponding .sh-file for the TK1.
The result of running tegrastats with TX1:
RAM 1461/3853MB (lfb 363x4MB) cpu [44%,35%,43%,27%]@1734 EMC 25%@1600 AVP 0%@80 GR3D 30%@998 EDP limit 1734
I am using the following frequencies: CPU 1734 MHz, GPU 998,4 MHz
When we benchmark TK1 vs TX1, we have seen some degradation, and some improvement, by deep SW architecture level synchronization.
To confirm if this is the problem; you can check the log for the execution from both and compare the time spent on the GPU. TX1 should be faster, if not, the slowdown is caused by GPU architecture changes, and probably the code is sub-optimal. If you see GPU execution in TX1 is faster, you can improve the pipeline by better use of streaming and synchronization.
Hope this helps on your case.