Is it possible to my own model optimization technique into TensorRT?


TensorRT is an algorithm optimizer. I would like to know if I can use my own optimization techniques. For example, let’s say I desing and algorithm to compress further the AI models and I want to use it with the TX2. How can I make TensorRT use it? Do I have to implement it using CUDA or PyCUDA?

Thank you


Would you mind to share more detail about the optimization?

If it is applied on the model architecture, you can do it directly and pass the pruned or compressed model to TensorRT.
If it is an implementation optimization, you can try to write your own code as plugin(C++):


Thank you very much for your answer.
I would like to implement the algorithm that is being discussed here:

Some more questions:
Are those plugins related to the graphsurgeon API?
Is it possible to see all the optimisations that are being applied by TensorRT?

Thank you


There are three stages mentioned in the paper: pruning, trained quantization and Huffman coding.
Based on the description, the optimization is to modify the network rather than an implementation for inference.

So ideally, you should follow the paper to get a pruned and quantization model first.
And compressed the model with Huffman code to get the output of the paper.

And when inference, you can feed the model directly to TensorRT after Huffman decoding.

The optimization is independent to the TensorRT implementation.
You don’t need to combine it into TensorRT.