Deploying GPT-J and T5 with FasterTransformer and Triton Inference Server

Originally published at: Deploying GPT-J and T5 with NVIDIA Triton Inference Server | NVIDIA Technical Blog

Learn step by step how to use the FasterTransformer library and Triton Inference Server to serve T5-3B and GPT-J 6B models in an optimal manner with tensor parallelism.

hi @jwitsoe ,
I am from the Chinese developer community.

There seems to be a picture mismatch in the results section of the article. Figure 5 should be T5-3B model inference speed-up comparison, but it shows GPT-J 6B. Thanks for letting us know. We’ve updated the image.