I have a video pipeline which collects frame with v4l2 userptr in managed memory, then does some cuda algorithm on it, then transfers it over ethernet.
For that I use cuda streams and I have 10 cycling buffers for incoming frames and 10 cycling buffers for outgoing frames that is each assigned to one stream. That way I can directly access one buffer from host, while the other one is processed in gpu.
I am doing some timing tests and I am measuring that the time taken is much longer than expected. Then I noticed that if I increase the fps in camera the processing takes less time and if decrease the fps the processing takes more time. Overall the amount of time needed is always lower than the time per frame based on fps, so that’s good but I am wondering what is happening there.
Is the gpu only working as fast as it needs to be somehow? Or am I measuring something wrong. Does cudaStreamSynchronize waits for something else than just my kernel?
Did anyone ever see similar bahviour?