What would it take to make 10k inferences per second of more than 2000 input tokens and 50 output tokens?
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Profiling CUDA time and memory during LLM inference | 0 | 291 | March 26, 2024 | |
Performance data mistakes in LLAMA inference | 1 | 409 | February 7, 2024 | |
추론 속도를 2배 높인 NVIDIA GH200 슈퍼칩, Llama 모델과의 멀티턴 상호작용에서 추론 가속화 | 1 | 11 | October 31, 2024 | |
Estimating inference and training time of a neural network on GPU | 2 | 2648 | February 5, 2022 | |
Low Latency Inference Chapter 1: Up to 1.9X Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch | 1 | 45 | August 28, 2024 | |
Achieving Top Inference Performance with the NVIDIA H100 Tensor Core GPU and NVIDIA TensorRT-LLM | 1 | 954 | December 14, 2023 | |
NIM HTTP API Inference (Run Anywhere) Taking Extremely Long! | 1 | 187 | September 11, 2024 | |
Does A30 GPU gives 3x RTX A4000 Performance | 0 | 715 | November 20, 2022 | |
Cuda Kernels running slow | 0 | 477 | November 9, 2018 | |
Jetson orin nano insanely slow inference speed? | 3 | 997 | May 6, 2024 |