Performance data mistakes in LLAMA inference

xuzhonghu1 · February 7, 2024, 7:29am

Inference Performance of LLAMA-2 posted by Nvidia

According to the link above, the inference lantecy of LLAMA-2-13B with A100 80GB SXM4 at batch size=1 and tp=1, is less than latency of LLAMA-2-7B under the same condition.

How can we get this performance data? It’s unbelievable and ridiculous.

Robert_Crovella · February 7, 2024, 3:10pm

You may get a better response by posting your question in the Nemo discussions area

Topic		Replies	Views
Achieving Top Inference Performance with the NVIDIA H100 Tensor Core GPU and NVIDIA TensorRT-LLM Technical Blog	1	964	December 14, 2023
Experiencing 504 Gateway Timeout & Slower Inference Speed with LLaMA3.3 70B on NIM API Models nim , llama	1	70	April 22, 2025
Recommend Compute for running a TensorRT-LLM using LLama2 13B & 70B model TensorRT	2	1042	November 15, 2023
Low Latency Inference Chapter 1: Up to 1.9X Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch Technical Blog llama	1	56	August 28, 2024
LLM Performance Benchmarking: Measuring NVIDIA NIM Performance with GenAI-Perf Technical Blog nim , llama	1	27	May 6, 2025
Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B AI Foundation Models and Endpoints nim , llm , llama	0	75	September 23, 2024
Llama 3 8b hardware GPU-Accelerated Libraries	1	95	July 10, 2024
NIM HTTP API Inference (Run Anywhere) Taking Extremely Long! Models nim , llama-31-70b-instruct , llama-31-405b-instruct , llama	1	231	September 11, 2024
NVIDIA H100 Tensor 코어 GPU 및 NVIDIA TensorRT-LLM으로 최고의 추론 성능 달성하기 Technical Blog - South Korea	0	495	December 15, 2023
Estimating inference and training time of a neural network on GPU Maxine	2	2704	February 5, 2022

Performance data mistakes in LLAMA inference

Related topics