Hi,
I want to be able to use quantized models listed in TensorRT library. As I can see, NIM has ModelOpt which does quantization. The issue is I can’t find any command for quantizing downloaded models. How can I do it?
Thanks for reaching out. This document may help answer your question.
https://docs.nvidia.com/deeplearning/tensorrt-cloud/v0.3.0-ea/user/build-trt-llm-engine.html
Best regards,
Tom
Thanks for fast response.
This documentation is only about using TensorRT-LLM outside of NIM container. I would like to know how I can quantize downloaded model which is located in /opt/nim/.cache inside NIM container. The official documenation for NIM does not provide any information about how to deploy quantized NIM model container.
Hi @rahmanrejepov777 – we don’t support quantizing the model engines downloaded by NIM. These engines are already compiled into TensorRT-LLMs binary execution format. Quantizing these would require separating out the weights and modifying the execution graph to support the quantized operations, which isn’t a supported workflow. To use TRT ModelOpt you should start from a supported checkpoint format, like Huggingface or .nemo
. From there, you would need to find some other way of deploying the quantized model, as NIM does not yet support custom compiled engines.
I see now.
Can I manually insert my own TensorRT-LLM engine into NIM container by quantizing the model weights downloaded from HuggingFace?
By “manually”, I mean getting NIM container’s filesystem and inserting my own TensorRT-LLM engine into filesystem and converting it back to docker container?
@rahmanrejepov777 this isn’t a supported workflow. You would need to use the same version of TRT-LLM as the original NIM container to compile the model, and match the file layout/contents of the NIM container. Since this is not a supported workflow, there’s no guarantee that the same process will work from model to model or release to release.