TRT Parallelism questions

  1. What is the default parallelism level per-model-instance ?
    I understand, TRTIS itself will support concurrent executions of a single model using CUDA streams. But, for one single-model, one single-instance, is each operator executed sequentially? Or do they have a set number of threads (cudastreams or otherwise)?

  2. Does TensorRT or TRTIS support model parallelism? By model parallelism, I mean, to split a single model across different GPU devices (for whatever reason).
    I understand, multiple instances of the model can be instantiated across different GPUs to increase throughput.


TRTIS (TensorRT Inference Server) supports multiple GPUs but it does not support running a single inference distributed across multiple GPUs. TRTIS can run multiple models (and/or multiple instances of the same model) on multiple GPUs to increase throughput.

Thank you for the reply.
Could you also confirm the first question?
Does TensorRT explore any parallelism for a single-model-single-instance for a single execution?
Is it fair to assume that tensorRT will use a single CUDA stream for a single-model-instance execution?

If you are asking to run inference on your model using multi-gpu? Unfortunately, that’s not possible with native TRT.