Originally published at: Accelerate Generative AI Inference Performance with NVIDIA TensorRT Model Optimizer, Now Publicly Available | NVIDIA Technical Blog
In the fast-evolving landscape of generative AI, the demand for accelerated inference speed remains a pressing concern. With the exponential growth in model size and complexity, the need to swiftly produce results to serve numerous users simultaneously continues to grow. The NVIDIA platform stands at the forefront of this endeavor, delivering perpetual performance leaps through…
Outstanding work! Thanks for the effort you guys put in!
Is it really possible to achieve 99.9% accuracy and no fine-tuning for the llama-70b-chat in mlperf task with 2:4 sparse? I have reproduced and tested it using MTO and found that it only achieves 98% accuracy in fp16。
Are you able to give me some suggestions to reproduce this work? Like which hyperparameters need to be adjusted? Is it necessary to use fp8 fine-tuning to achieve 99.9% accuracy?
Hi @de_hua_tang, thanks for the comment. Could you post this question along with details of the testing you mentioned to TensorRT Model Optimizer’s Github Issue? It’ll be easier for our engineers to discuss with you there.
Thank you for your reply. I raised the same question on TensorRT Model Optimizer’s Github Issue.