TensorRT optimization for pruning


When converting a pruned model to a TensorRT model in caffe, are the pruned weights removed or excluded from the operation?

As far as I can tell, TensorRT does not automatically remove pruned weights.

With a heavily pruned TF model (deflates 80% when zipping the frozen graph), I see no increase in inference speed after I have converted it to a TensorRT engine with the python API 2.

But I REALLY hopy I’m doing something wrong and that TensorRT is capable of this.

Bump, would be nice to have an “official” answer.

Is there an existing feature for this? Or planned?

Could you please elaborate more on what form of pruning you are using in this case?

Also, if possible could you please share the model and sample script to reproduce the issue so we can help better.



I used https://www.tensorflow.org/model_optimization/guide/pruning which basically makes the model sparse - but if I understand it correctly TensorRT does not have any optimizations for sparse matrix multiplications.

Or do you mean that you expect an increase in inference speed for sparse models?

I’ve seen many references to pruned models with TensorRT, but no mention as to what techniques were used that TensorRT.

If you channel prune models in the right way (and then compress them), you won’t get any increase in speed in TensorRT.
But you should contact the people who created those models for more information on how they were pruned, since it wasn’t anything in TensorRT.

GPUs are simply very good at dense math, so unless sparsity is appropriately structured or weights are very sparse, sparse computations are unlikely to improve performance.