I am trying to deploy a Nemo Machine Translation model to triton inference server.
In my understanding the pipeline looks something like this:
Raw Text -> Preprocessing -> Encoder -> Decoder -> Probability for next Token ↑ ↓ Top k Tokens <- Beam Search
So there is a recursive call to the decoder after each token output.
I am not sure how to best deploy this to triton. I can export the encoder and decoder to ONNX, separately but not sure how to put them together in a pipeline.
So far I have set up a python backend that just loads the model and performs translation end-to-end, but that can only run on CPU and so is not performant. My next best idea is to use python backend BLS which would allow calling encoder and decoder from there.
However, it still seems somewhat suboptimal and so I am looking for better suggestions on how to deploy this.
Thanks in advance