**The goal** - get faster inference time, running on TX2

**The flow:**

- I have a keras model which I have trained and converted to tensorRT, using the function - “trt.create_inference_graph”. Inference time ~27[msec].
- after pruning the model and converting it to tensorRT, inference time remains the same.
- pruning was done using the “prune_low_magnitude” function, setting the final sparsity to ~90% in “PolynomialDecay”
- the pruned model is significantly smaller if compressed (~3 times smaller), which indicates that there is a large number of 0s, as expected.

**The questions:**

- does inference time depend only on the number of nodes and engines of a tensorRT model?

i.e. if after pruning I get the same number as without pruning, then inference time will surely be the same? - is there another step after pruning a keras model, prior to converting to tensorRT, to enable a faster inference?

i.e. pruning gives me just a different set of weights, many of them are 0, which is good to get a small compressed file, but is there a way to “remove” them from the flow or ignore them somehow to get a faster inference? - the pruned tensorRT (.pb) file I got was just a little smaller than the original - is that an indication for something or just a “fluke”?

I have pruned to various level (30%, 70%…) but tensorRT model remains roughly the same in size, smaller than original but is not descending with pruning percentage.

Thanks for the help