Is it possible to run multiple TensorRT model inference on a GPU simultaneously and parallelly?

Say, I have several small models. I want to run the inferences simultaneously and parallelly on a single GPU (e.g. 2080 TI or Jetson Xavier). Is it possible? Like, divide cuda units to several groups for each model?


This looks like a Jetson issue. Please refer to the below samples in case useful.

For any further assistance, we will move this post to to Jetson related forum.


well I do not think it is just a TensorRT or Jetson issue … I am wondering is it possible to separate cuda units by CUDA API and only make part of units visible to specific processes, not only for TensorRT or Jetson apps. But seems it is not an option.

Nvidia Triton inference server will help you to deploy multiple models and run them parallelly. It will be internally managed.

Thank you.