In my understanding the pipeline looks something like this:
Raw Text -> Preprocessing -> Encoder -> Decoder -> Probability for next Token
↑ ↓
Top k Tokens <- Beam Search
So there is a recursive call to the decoder after each token output.
I am not sure how to best deploy this to triton. I can export the encoder and decoder to ONNX, separately but not sure how to put them together in a pipeline.
So far I have set up a python backend that just loads the model and performs translation end-to-end, but that can only run on CPU and so is not performant. My next best idea is to use python backend BLS which would allow calling encoder and decoder from there.
However, it still seems somewhat suboptimal and so I am looking for better suggestions on how to deploy this.
If you don’t have separate models for pre and post-processing and need them to be included in the pipeline, you should be able to use the python/BLS backends for those steps in the ensemble as well.
It seems the ensemble setup only allows linear pipelines, at least I have not found any other examples which were similar to the pipeline I described above.
The issue is that I have to essentially perform beam search (post-processing) in a loop calling the decoder again and again. Is there maybe an imlementation of beam search for triton?
Yes you are correct that ensemble doesn’t support the loop in the pipeline as described through standard models. You would want to use BLS as you mentioned to implement this. You could either use BLS just to implement the loop section of the ensemble as a “model”, or you could use BLS to implement the whole pipeline instead of using ensembles at all.