Deploying GPT-J and T5 with FasterTransformer and Triton Inference Server

jwitsoe · August 3, 2022, 5:00pm

Originally published at: Deploying GPT-J and T5 with NVIDIA Triton Inference Server | NVIDIA Technical Blog

Learn step by step how to use the FasterTransformer library and Triton Inference Server to serve T5-3B and GPT-J 6B models in an optimal manner with tensor parallelism.

kun.he.love.u · August 12, 2022, 12:21am

hi @jwitsoe ,
I am from the Chinese developer community.

There seems to be a picture mismatch in the results section of the article. Figure 5 should be T5-3B model inference speed-up comparison, but it shows GPT-J 6B.

dtimonin · August 15, 2022, 5:26pm

@kun.he.love.u Thanks for letting us know. We’ve updated the image.

aneesinaec · April 3, 2023, 10:09am

Can you please provide similar step by step guide for multi node inference example with Triton server?

jarretbembry · April 6, 2023, 5:58am

wow, I was waiting for such a guide for a long. Waiting for smth like that with Trition server as was asked above by aneesinaec.

aneesinaec · April 17, 2023, 10:01am

I tried similar exercise with a bloom model.
I have 2 GPU’s with ~10 GB Memory on each. While trying to load a 14GB model in 2 GPU config, I keep getting out of memory error.

Fasttransformer backend supposed to split the 14 GB model in to 2 CPU’s and load…rite? What am i possibly missing here?

bhsueh · April 19, 2023, 1:11am

Thank you for the feedback. We will consider it. You can also refer the document in fastertransformer_backend/t5_guide.md at main · triton-inference-server/fastertransformer_backend · GitHub first.

bhsueh · April 19, 2023, 1:12am

Can you share more details and your steps?

Topic		Replies	Views
Accelerated Inference for Large Transformer Models Using FasterTransformer and Triton Inference Server Technical Blog	1	554	August 10, 2023
Deploying Models from TensorFlow Model Zoo Using NVIDIA DeepStream and NVIDIA Triton Inference Server Technical Blog	13	1182	May 25, 2022
Deploying Models from TensorFlow Model Zoo Using NVIDIA DeepStream and NVIDIA Triton Inference Server DeepStream SDK	3	8914	February 29, 2024
Trying to run TensorFlow 1.15 produced graphdefs with TF2 based tensorRT but TensorRT model is not building correctly TensorRT	6	993	July 15, 2021
Installing Triton Server on Lenovo SE70 with Xavier NX Jetson Xavier NX inference-server-triton	20	1006	April 22, 2024
Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes Technical Blog llama	1	30	October 22, 2024
Deploying AI Deep Learning Models with NVIDIA Triton Inference Server Technical Blog	0	398	December 18, 2020
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	1703	January 25, 2024
Serving ML Model Pipelines on NVIDIA Triton Inference Server with Ensemble Models Technical Blog	1	538	July 13, 2023
Fast and Scalable AI Model Deployment with NVIDIA Triton Inference Server Technical Blog	0	421	November 9, 2021

Deploying GPT-J and T5 with FasterTransformer and Triton Inference Server

Related topics