Accelerated Inference for Large Transformer Models Using FasterTransformer and Triton Inference Server

Originally published at: Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server | NVIDIA Technical Blog

Learn about FasterTransformer, one of the fastest libraries for distributed inference of transformers of any size, including benefits of using the library.

Which NVIDIA hardware ressources do I need to deploy ChatGPT-J