Jetson TK1 CUDA performance in multithreaded app

Hi, folks!

I have a custom-made CUDA library and benchmark it in maximum CPU & GPU performance mode (Jetson/Performance - eLinux.org). Benchmark is a simple performance test, like call function 100 times and take median time (actually it’s OpenCV’s perf test). Got reasonable numbers. I also have multithreaded application, that processes frames from videofile with this library and displays the result. However, processing time in application at least twice bigger, than in performance test. That only happens in maximum performance mode, with default settings, times are the same. I’ve profile gpu and memory load, it’s about 10%, so it’s not a resource issue (also there is no such problem on PC). Could you, please, help me with this “maximum performance mode” problem?