Originally published at: Deploying GPT-J and T5 with NVIDIA Triton Inference Server | NVIDIA Technical Blog
Learn step by step how to use the FasterTransformer library and Triton Inference Server to serve T5-3B and GPT-J 6B models in an optimal manner with tensor parallelism.
hi @jwitsoe ,
I am from the Chinese developer community.
There seems to be a picture mismatch in the results section of the article. Figure 5 should be T5-3B model inference speed-up comparison, but it shows GPT-J 6B.
@kun.he.love.u Thanks for letting us know. We’ve updated the image.
Can you please provide similar step by step guide for multi node inference example with Triton server?
wow, I was waiting for such a guide for a long. Waiting for smth like that with Trition server as was asked above by aneesinaec.
I tried similar exercise with a bloom model.
I have 2 GPU’s with ~10 GB Memory on each. While trying to load a 14GB model in 2 GPU config, I keep getting out of memory error.
Fasttransformer backend supposed to split the 14 GB model in to 2 CPU’s and load…rite? What am i possibly missing here?
Thank you for the feedback. We will consider it. You can also refer the document in fastertransformer_backend/t5_guide.md at main · triton-inference-server/fastertransformer_backend · GitHub first.
Can you share more details and your steps?