Model deployment in Triton


I am new to Triton and model deployment using it.
I have a speech-to-text ML model that is saved as a Pytorch structure. I also have it as a docker deployment that uses a FastAPI.
This model takes audio files and exports the result as a text or a JSON file.
I need to scale this model over an A100 GPU machine. The model itself is probably less than 0.5 GB.

  1. Is Triton able to distribute the deployment of this model over multiple cores on a single GPU?
  2. If yes, what is a better solution?
    2.1. The python backend
    2.2. The docker with FastAPI?
  3. Is there a better way to do this and scale the model?



We recommend you please reach out to the Triton git issues to get better help.

Thank you.