Is there a way to allocate priority among different ExecutionContexts in TensorRT?


I want to set priority using TensorRT and make preemption by priority in GPU.
As far as I know, it is possible to give priority to a stream by using cuStreamCreateWithPriority() when creating a stream. Also, stream priorities are within a cuda context [1].

Here a question arises.
The context of cuda (CUcontext) and the ExecutionContext of tensorRT (IExecutionContext) are different.

  1. Can preemption take place using streams with different priorities on different ExecutionContexts?
    For example, ExecutionContext A use high priority stream a and ExecutionContext B use low priority stream b. Then, A can preempt B?

  2. If the answer of #1 is true, then how about among different processes?
    For instance, ExecutionContext A use high priority stream a in the application 1 and ExecutionContext B use low priority stream b in the application 2. Then, A can preempt B?

  3. If the answer of #2 is false, then is there way to assign different priorities between different processes?

  4. In addition, is there any relationship between CUcontext and IExecutionContext in terms of priority assignment?
    Different CUcontext can not run concurrently (i.e., A GPU never runs work (kernels) from 2 or more contexts simultaneously [1]. However, the IExecutionContexts can run concurrently [2-3]. Actually, I confirmed the running of different ExecutionContexts with different streams simultaneously through the experiment and nsight systems, and it shows great improvement. So, executions behaviors are different, then how about in terms of priority allocation?


[1] GPU sharing among different application with different CUDA context - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums
[2] Can I inference two engine simultaneous on jetson using TensorRT? - Jetson & Embedded Systems / Jetson TX2 - NVIDIA Developer Forums
[3] Multiple concurrent Execution Contexts?

Hi @urmydata,
You can use the stream that you have created with priority for inference purposes.
There’s no different behavior than any other CUDA app. For more info about how this works, see: