My application needs to process images coming from a camera as fast as possible. The processing for this is rather short, and fits within a frame period, so it’s all good.
However, I would like to add a long, compute-intensive task that also uses CUDA on a separate thread. This task is not required to run at the frame rate. The result is only needed every few frames.
I don’t want this long task to hold up the GPU and interfere with the real-time processing, so I have broken it into several steps. I trigger the state machine after every incoming frame and it performs a sufficiently small bit of work before returning the GPU to the system.
This works well for me, but I was wondering if there was a better or built-in method/paradigm to manage this kind of behaviour in CUDA. Are there synchronization or “yield” mechanisms?
cuda stream priority may be useful. In order for this to be useful, it will be necessary for your long-running task to “cycle through” threadblocks at some reasonable rate (to give the block scheduler an oppty to insert higher priority blocks).
In a multiprocess setting, there are also possibilities using CUDA MPS to assign resources to processes.
@Robert_Crovella Purely from a performance perspective, would there be any reason to assume that these alternatives would be superior to asker’s current scheme? I suspect the answer is “no”, but have no deep insight into the trade-offs from a performance angle.
Thanks for the quick feedback. I was really just trying to calibrate my own thinking: While the built-in methods have the potential to improve flexibility and ease-of-use, they are unlikely to lead to improved performance.
Understood, and I would expect as much. Thank you nevertheless for the clarification. I was indeed looking to improve flexibility and ease of use - “user friendliness”, in a way.
Thanks to all for the constructive conversation! I don’t post often on this forum, but it’s nice to see an active community!