I am new to Triton and model deployment using it.
I have a speech-to-text ML model that is saved as a Pytorch structure. I also have it as a docker deployment that uses a FastAPI.
This model takes audio files and exports the result as a text or a JSON file.
I need to scale this model over an A100 GPU machine. The model itself is probably less than 0.5 GB.
- Is Triton able to distribute the deployment of this model over multiple cores on a single GPU?
- If yes, what is a better solution?
2.1. The python backend
2.2. The docker with FastAPI?
- Is there a better way to do this and scale the model?