Llama 3 8b hardware

martinrojoandres1 · July 8, 2024, 3:54pm

What would it take to make 10k inferences per second of more than 2000 input tokens and 50 output tokens?

Robert_Crovella · July 10, 2024, 4:38pm

Data here may be of interest.

Topic		Replies	Views
Profiling CUDA time and memory during LLM inference AI Foundation Models and Endpoints cuda , kernel	0	291	March 26, 2024
Performance data mistakes in LLAMA inference CUDA Programming and Performance tensorrt , natural-language-processing-nlp , inference-server-triton	1	409	February 7, 2024
추론 속도를 2배 높인 NVIDIA GH200 슈퍼칩, Llama 모델과의 멀티턴 상호작용에서 추론 가속화 Technical Blog - South Korea llama	1	11	October 31, 2024
Estimating inference and training time of a neural network on GPU Maxine	2	2648	February 5, 2022
Low Latency Inference Chapter 1: Up to 1.9X Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch Technical Blog llama	1	45	August 28, 2024
Achieving Top Inference Performance with the NVIDIA H100 Tensor Core GPU and NVIDIA TensorRT-LLM Technical Blog	1	954	December 14, 2023
NIM HTTP API Inference (Run Anywhere) Taking Extremely Long! Models nim , llama-31-70b-instruct , llama-31-405b-instruct , llama	1	187	September 11, 2024
Does A30 GPU gives 3x RTX A4000 Performance GPU-Accelerated Libraries gpu	0	715	November 20, 2022
Cuda Kernels running slow CUDA Programming and Performance	0	477	November 9, 2018
Jetson orin nano insanely slow inference speed? Jetson Orin Nano generative_ai	3	997	May 6, 2024