Deploying GPT-J and T5 with FasterTransformer and Triton Inference Server

Originally published at: Deploying GPT-J and T5 with NVIDIA Triton Inference Server | NVIDIA Technical Blog

Learn step by step how to use the FasterTransformer library and Triton Inference Server to serve T5-3B and GPT-J 6B models in an optimal manner with tensor parallelism.

hi @jwitsoe ,
I am from the Chinese developer community.

There seems to be a picture mismatch in the results section of the article. Figure 5 should be T5-3B model inference speed-up comparison, but it shows GPT-J 6B.

@kun.he.love.u Thanks for letting us know. We’ve updated the image.

Can you please provide similar step by step guide for multi node inference example with Triton server?

wow, I was waiting for such a guide for a long. Waiting for smth like that with Trition server as was asked above by aneesinaec.

I tried similar exercise with a bloom model.
I have 2 GPU’s with ~10 GB Memory on each. While trying to load a 14GB model in 2 GPU config, I keep getting out of memory error.

Fasttransformer backend supposed to split the 14 GB model in to 2 CPU’s and load…rite? What am i possibly missing here?

Thank you for the feedback. We will consider it. You can also refer the document in fastertransformer_backend/t5_guide.md at main · triton-inference-server/fastertransformer_backend · GitHub first.

Can you share more details and your steps?