NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick

jwitsoe · April 6, 2025, 2:18am

Originally published at: NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick | NVIDIA Technical Blog

The newest generation of the popular Llama AI models is here with Llama 4 Scout and Llama 4 Maverick. Accelerated by NVIDIA open-source software, they can achieve over 40K output tokens per second on NVIDIA Blackwell B200 GPUs, and are available to try as NVIDIA NIM microservices. The Llama 4 models are now natively multimodal…

pr18 · April 12, 2025, 6:48am

Hi, thanks for the performance data. Can you clarify some of the parameters for the results. Specifically, what is the input prompt size, what is the output size, what is the batch size? Can you also clarify the number of GPUs and total memory for this test?

Topic		Replies	Views
NVIDIA, Meta Llama 4 Scout 및 Maverick에서의 추론 가속화 Technical Blog - South Korea nim , llama	1	15	April 18, 2025
NVIDIA 플랫폼 전반에서 Llama 3.1 강화하기 Technical Blog - South Korea llama	1	22	August 2, 2024
엣지에서 클라우드로 가속화된 Llama 3.2 배포하기 Technical Blog - South Korea llama	1	27	September 30, 2024
NVIDIA TensorRT-LLM Revs Up Inference for Google Gemma Technical Blog	1	227	February 21, 2024
Deploying Accelerated Llama 3.2 from the Edge to the Cloud Technical Blog llama	1	69	September 25, 2024
Customize Generative AI Models for Enterprise Applications with Llama 3.1 Technical Blog	2	49	July 25, 2024
NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs Technical Blog	5	1042	September 27, 2023
NVIDIA AI Platform Delivers Big Gains for Large Language Models Technical Blog	0	416	July 28, 2022
Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs Technical Blog llama	2	51	September 17, 2024
NVIDIA TensorRT-LLM Enhancements Deliver Massive Large Language Model Speedups on NVIDIA H200 Technical Blog	0	413	December 5, 2023

NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick

Related topics