Originally published at: Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding | NVIDIA Technical Blog
Meta’s Llama collection of open large language models (LLMs) continues to grow with the recent addition of Llama 3.3 70B, a text-only instruction-tuned model. Llama 3.3 provides enhanced performance respective to the older Llama 3.1 70B model and can even match the capabilities of the larger, more computationally expensive Llama 3.1 405B model on several tasks including…
@jwitsoe , Will you please share the steps to benchmark the target engine (built as per the steps given in blog) . In the Blog , steps are missing to benchmark target engine. We can compare/validate the results if we have those. Your response would be greatly appreciated.
1 Like