I have a video camera that I am trying to run two CUDA processing streams on. The first stream performs basic video processing tasks, like demosaicing and brightness/contrast adjustment, while the second stream performs object detection using YOLOv2 and TensorRT. This is all on a single GPU scheduled within a single CPU process.
The goal is to get these streams to run asynchronously from each other. More specifically, I want the standard video processing stream to run at 60 Hz, while the object detection stream runs at 10 Hz.
The problem is that, even though the two tasks are in their own CUDA streams, the TensorRT stream starves out my video processing stream. The observed behavior is that any kernel scheduled/launched while the YOLOv2 kernels are running does not start execution until all the YOLOv2 kernels complete… in effect, throttling my video processing down to 10 Hz.
I did some research and discovered that my GPU supports stream priorities. So I modified the video processing stream to be the highest priority stream and modified the TensorRT stream to be the lowest priority stream. Unfortunately there seems to be no change to the behavior. I expected that my video processing kernels would preempt the TensorRT kernels, but this is not the case.
I looked through the TensorRT documentation to see if there are any settings to get the kernels to yield, but this feature doesn’t appear to exist.
Anyone have any clues as to what I might be doing wrong, or what I might try next?