Originally published at: https://developer.nvidia.com/blog/achieving-high-mixtral-8x7b-performance-with-nvidia-h100-tensor-core-gpus-and-tensorrt-llm/
As large language models (LLMs) continue to grow in size and complexity, the performance requirements for serving them quickly and cost-effectively continue to grow. To deliver high LLM inference performance, an efficient parallel computing architecture and a flexible and highly-optimized software stack are required. Recently, NVIDIA Hopper GPUs running NVIDIA TensorRT-LLM inference software set new…