Deploying Machine Translation to Triton Inference Server

Hi,

I am trying to deploy a Nemo Machine Translation model to triton inference server.

In my understanding the pipeline looks something like this:

Raw Text -> Preprocessing -> Encoder -> Decoder -> Probability for next Token
                                           ↑                     ↓
                                     Top k Tokens      <-    Beam Search

So there is a recursive call to the decoder after each token output.

I am not sure how to best deploy this to triton. I can export the encoder and decoder to ONNX, separately but not sure how to put them together in a pipeline.

So far I have set up a python backend that just loads the model and performs translation end-to-end, but that can only run on CPU and so is not performant. My next best idea is to use python backend BLS which would allow calling encoder and decoder from there.

However, it still seems somewhat suboptimal and so I am looking for better suggestions on how to deploy this.

Thanks in advance

Hi Vlad,

Thanks for reaching out on the NVIDIA Developer Forums.

Based on the pipeline structure, I would recommend trying to setup a Triton ensemble model with the ONNX encoder/decoder models you separated out. There is some documentation on Triton ensemble models here: server/architecture.md at main · triton-inference-server/server · GitHub.

If you don’t have separate models for pre and post-processing and need them to be included in the pipeline, you should be able to use the python/BLS backends for those steps in the ensemble as well.

Let me know what you think.

Hi, thanks for the reply

It seems the ensemble setup only allows linear pipelines, at least I have not found any other examples which were similar to the pipeline I described above.

The issue is that I have to essentially perform beam search (post-processing) in a loop calling the decoder again and again. Is there maybe an imlementation of beam search for triton?

Hi Vlad,

Yes you are correct that ensemble doesn’t support the loop in the pipeline as described through standard models. You would want to use BLS as you mentioned to implement this. You could either use BLS just to implement the loop section of the ensemble as a “model”, or you could use BLS to implement the whole pipeline instead of using ensembles at all.

I would refer to the BLS example to get started, and for further questions I would raise an issue on the Github: Issues · triton-inference-server/server · GitHub

Hope this helps!

1 Like