Originally published at: Achieving Top Inference Performance with the NVIDIA H100 Tensor Core GPU and NVIDIA TensorRT-LLM | NVIDIA Technical Blog
Best-in-class AI performance requires an efficient parallel computing architecture, a productive tool stack, and deeply optimized algorithms. NVIDIA released the open-source NVIDIA TensorRT-LLM, which includes the latest kernel optimizations for the NVIDIA Hopper architecture at the heart of the NVIDIA H100 Tensor Core GPU. These optimizations enable models like Llama 2 70B to execute using…
Ok, but …Isn’t it even more misleading to switch the measurement type for h100 to throughput instead of latency,
and then put it all on the same graph with AMD’s latency measurement and the throughput on the y-axis?
At least a “fixed response time” measurement for AMD is needed, or another graph where AMD is missing (thus challenging the execution of such a test).
1 Like