Model deployment in Triton

Description

I am new to Triton and model deployment using it.
I have a speech-to-text ML model that is saved as a Pytorch structure. I also have it as a docker deployment that uses a FastAPI.
This model takes audio files and exports the result as a text or a JSON file.
I need to scale this model over an A100 GPU machine. The model itself is probably less than 0.5 GB.
Questions:

  1. Is Triton able to distribute the deployment of this model over multiple cores on a single GPU?
  2. If yes, what is a better solution?
    2.1. The python backend
    2.2. The docker with FastAPI?
  3. Is there a better way to do this and scale the model?

Thanks!

Hi,

We recommend you please reach out to the Triton git issues to get better help.

Thank you.