The goal - get faster inference time, running on TX2
The flow:
I have a keras model which I have trained and converted to tensorRT, using the function - “trt.create_inference_graph”. Inference time ~27[msec].
after pruning the model and converting it to tensorRT, inference time remains the same.
pruning was done using the “prune_low_magnitude” function, setting the final sparsity to ~90% in “PolynomialDecay”
the pruned model is significantly smaller if compressed (~3 times smaller), which indicates that there is a large number of 0s, as expected.
The questions:
does inference time depend only on the number of nodes and engines of a tensorRT model?
i.e. if after pruning I get the same number as without pruning, then inference time will surely be the same?
is there another step after pruning a keras model, prior to converting to tensorRT, to enable a faster inference?
i.e. pruning gives me just a different set of weights, many of them are 0, which is good to get a small compressed file, but is there a way to “remove” them from the flow or ignore them somehow to get a faster inference?
the pruned tensorRT (.pb) file I got was just a little smaller than the original - is that an indication for something or just a “fluke”?
I have pruned to various level (30%, 70%…) but tensorRT model remains roughly the same in size, smaller than original but is not descending with pruning percentage.
Sorry for the late update.
This post is somehow missing in our process.
1.
In general, TensorRT inference time depends on how much kernel you have launched and how long it takes.
If most of the weights are pruned to 0, it should lead to much lower kernel jobs and faster.
Is it possible that you are still using the old model?
Please noticed that you will need to recreate the TensorRT engine if you have serialized it.
2. You can try to remove all the non-necessary node. Ex. the auxiliary node for training.
3. To give a more precise suggestion, would you mind to tell us which pruning tool you are using.
Is it from Keras?
We also have a toolkit which can “slim” the model but different from weight pruning.
TLT toolkit intends to remove the non-necessary layer and fine turn the model again.
Another thing worth to check is how many layers are inferenced with TensorRT.
Please noticed that the frameworks you used is TF-TRT.
TF-TRT integrated TensorRT into TensorFlow interface so the layer may be inference with either TensorFlow or TensorRT.
The layer deployment can be found in the TensorFlow log.
Could you help to collect it and share with us?
It took a while, but in continuation to this topic the situation is now this:
using the TLT there is an object detection Engine (based on an SSD mobilenet v2)
using the TLT the model was pruned (~half the original size)
Even though the model has been pruned, inference time remains - ~20-25[msec] on a TX2.
Is that the expected behaviour?
Should I prune to a much smaller size?
Is there anything I can verify (aside from file size and accuracy) in logs or anywhere, that pruning did work as expected with the end goal of having a faster inference time?
the pruning was done via the TLT and naturally I compared original with pruned.
what level of pruning should I aim for? i.e. do I get a visible speed up at 30% (keep 30% of the original model) or do I need to go much further to 10% or similar?
This depends on your use case.
For example, some user will expect 30 fps for their use case.
So they will stop pruning once the performance meet their requirement.
This is a trade-off between the accuracy and performance.