What is the default parallelism level per-model-instance ?
I understand, TRTIS itself will support concurrent executions of a single model using CUDA streams. But, for one single-model, one single-instance, is each operator executed sequentially? Or do they have a set number of threads (cudastreams or otherwise)?
Does TensorRT or TRTIS support model parallelism? By model parallelism, I mean, to split a single model across different GPU devices (for whatever reason).
I understand, multiple instances of the model can be instantiated across different GPUs to increase throughput.
TRTIS (TensorRT Inference Server) supports multiple GPUs but it does not support running a single inference distributed across multiple GPUs. TRTIS can run multiple models (and/or multiple instances of the same model) on multiple GPUs to increase throughput.
Thank you for the reply.
Could you also confirm the first question?
Does TensorRT explore any parallelism for a single-model-single-instance for a single execution?
Is it fair to assume that tensorRT will use a single CUDA stream for a single-model-instance execution?
If you are asking to run inference on your model using multi-gpu? Unfortunately, that’s not possible with native TRT.