Deploying Machine Translation to Triton Inference Server

Vlad.Bondarenko · September 7, 2021, 12:27pm

Hi,

I am trying to deploy a Nemo Machine Translation model to triton inference server.

In my understanding the pipeline looks something like this:

Raw Text -> Preprocessing -> Encoder -> Decoder -> Probability for next Token
                                           ↑                     ↓
                                     Top k Tokens      <-    Beam Search

So there is a recursive call to the decoder after each token output.

I am not sure how to best deploy this to triton. I can export the encoder and decoder to ONNX, separately but not sure how to put them together in a pipeline.

So far I have set up a python backend that just loads the model and performs translation end-to-end, but that can only run on CPU and so is not performant. My next best idea is to use python backend BLS which would allow calling encoder and decoder from there.

However, it still seems somewhat suboptimal and so I am looking for better suggestions on how to deploy this.

Thanks in advance

NVES_R · September 9, 2021, 9:16pm

Hi Vlad,

Thanks for reaching out on the NVIDIA Developer Forums.

Based on the pipeline structure, I would recommend trying to setup a Triton ensemble model with the ONNX encoder/decoder models you separated out. There is some documentation on Triton ensemble models here: https://github.com/triton-inference-server/server/blob/main/docs/architecture.md#ensemble-models.

If you don’t have separate models for pre and post-processing and need them to be included in the pipeline, you should be able to use the python/BLS backends for those steps in the ensemble as well.

Let me know what you think.

Vlad.Bondarenko · September 10, 2021, 11:13am

Hi, thanks for the reply

It seems the ensemble setup only allows linear pipelines, at least I have not found any other examples which were similar to the pipeline I described above.

The issue is that I have to essentially perform beam search (post-processing) in a loop calling the decoder again and again. Is there maybe an imlementation of beam search for triton?

NVES_R · September 13, 2021, 8:36pm

Hi Vlad,

Yes you are correct that ensemble doesn’t support the loop in the pipeline as described through standard models. You would want to use BLS as you mentioned to implement this. You could either use BLS just to implement the loop section of the ensemble as a “model”, or you could use BLS to implement the whole pipeline instead of using ensembles at all.

I would refer to the BLS example to get started, and for further questions I would raise an issue on the Github: Issues · triton-inference-server/server · GitHub

Hope this helps!

nadeemm · October 1, 2021, 3:00pm

This topic was automatically closed after 17 days. New replies are no longer allowed.

Topic		Replies	Views
Serving ML Model Pipelines on NVIDIA Triton Inference Server with Ensemble Models Technical Blog	1	537	July 13, 2023
Error when using ensemble model with deepstream-5.1 : failed to get input buffer in CPU memory DeepStream SDK inference-server-triton	7	1202	September 4, 2021
Triton deployment and inference TAO Toolkit	4	1304	July 27, 2021
How to read input tensor in c++ BLS backend as getting memory type 2 in BLS DeepStream SDK inference-server-triton	4	422	August 22, 2023
Help with efficient execution of triton ensembles DeepStream SDK inference-server-triton	8	396	March 1, 2024
Deepstream Triton Ensemble Model Error DeepStream SDK inference-server-triton	8	1021	June 15, 2022
Triton server ONNX support DeepStream SDK	5	540	June 28, 2022
Batching preprocess in Triton Frameworks inference-server-triton	0	479	July 25, 2023
Deepstream and Triton containers DeepStream SDK deepstream	5	27	September 30, 2024
Develop ML and AI with Metaflow and Deploy with NVIDIA Triton Inference Server Technical Blog	2	357	January 5, 2024

Deploying Machine Translation to Triton Inference Server

Related topics