Originally published at: https://developer.nvidia.com/blog/nvidia-nvlink-and-nvidia-nvswitch-supercharge-large-language-model-inference/
Large language models (LLM) are getting larger, increasing the amount of compute required to process inference requests. To meet real-time latency requirements for serving today’s LLMs and do so for as many users as possible, multi-GPU compute is a must. Low latency improves the user experience. High throughput reduces the cost of service. Both are…