NIM model quantization

Hi,
I want to be able to use quantized models listed in TensorRT library. As I can see, NIM has ModelOpt which does quantization. The issue is I can’t find any command for quantizing downloaded models. How can I do it?

Hi @rahmanrejepov777,

Thanks for reaching out. This document may help answer your question.

https://docs.nvidia.com/deeplearning/tensorrt-cloud/v0.3.0-ea/user/build-trt-llm-engine.html

Best regards,
Tom

Thanks for fast response.

This documentation is only about using TensorRT-LLM outside of NIM container. I would like to know how I can quantize downloaded model which is located in /opt/nim/.cache inside NIM container. The official documenation for NIM does not provide any information about how to deploy quantized NIM model container.

1 Like

Hi @rahmanrejepov777 – we don’t support quantizing the model engines downloaded by NIM. These engines are already compiled into TensorRT-LLMs binary execution format. Quantizing these would require separating out the weights and modifying the execution graph to support the quantized operations, which isn’t a supported workflow. To use TRT ModelOpt you should start from a supported checkpoint format, like Huggingface or .nemo. From there, you would need to find some other way of deploying the quantized model, as NIM does not yet support custom compiled engines.

2 Likes

I see now.

Can I manually insert my own TensorRT-LLM engine into NIM container by quantizing the model weights downloaded from HuggingFace?

By “manually”, I mean getting NIM container’s filesystem and inserting my own TensorRT-LLM engine into filesystem and converting it back to docker container?

@rahmanrejepov777 this isn’t a supported workflow. You would need to use the same version of TRT-LLM as the original NIM container to compile the model, and match the file layout/contents of the NIM container. Since this is not a supported workflow, there’s no guarantee that the same process will work from model to model or release to release.

1 Like