I have some performance issues on the Jetson-TK1. I have implemented a video processing filter which rotates a video stream for both the CPU and the GPU using CUDA. The workload can be partitioned between the GPU and the CPU (for example x % load to the CPU, and (100-x) % to the GPU) and is very fine-grained. The problem is, I don’t get the performance I expect, and I am beginning to wonder if the performance issues can be caused by memory conflicts? My implementation uses zero-copy memory, so the memory regions that are read / written are shared between the CPU and the GPU.
Is there any way to check memory clock speed / clock it manually?