Should pruning a model prior to converting it to tensorRT make inference faster?

The goal - get faster inference time, running on TX2

The flow:

  • I have a keras model which I have trained and converted to tensorRT, using the function - “trt.create_inference_graph”. Inference time ~27[msec].
  • after pruning the model and converting it to tensorRT, inference time remains the same.
  • pruning was done using the “prune_low_magnitude” function, setting the final sparsity to ~90% in “PolynomialDecay”
  • the pruned model is significantly smaller if compressed (~3 times smaller), which indicates that there is a large number of 0s, as expected.

The questions:

  1. does inference time depend only on the number of nodes and engines of a tensorRT model?
    i.e. if after pruning I get the same number as without pruning, then inference time will surely be the same?
  2. is there another step after pruning a keras model, prior to converting to tensorRT, to enable a faster inference?
    i.e. pruning gives me just a different set of weights, many of them are 0, which is good to get a small compressed file, but is there a way to “remove” them from the flow or ignore them somehow to get a faster inference?
  3. the pruned tensorRT (.pb) file I got was just a little smaller than the original - is that an indication for something or just a “fluke”?
    I have pruned to various level (30%, 70%…) but tensorRT model remains roughly the same in size, smaller than original but is not descending with pruning percentage.

Thanks for the help

in continuation to the above, if I look at the link below, there are various API for tensorRT -
https://on-demand.gputechconf.com/gtc-cn/2019/pdf/CN9456/presentation.pdf,
will using a different API with a pruned model give me a speedup?

@nvidia - anyone please?

Hi,

Sorry for the late update.
This post is somehow missing in our process.

1.
In general, TensorRT inference time depends on how much kernel you have launched and how long it takes.
If most of the weights are pruned to 0, it should lead to much lower kernel jobs and faster.

Is it possible that you are still using the old model?
Please noticed that you will need to recreate the TensorRT engine if you have serialized it.

2. You can try to remove all the non-necessary node. Ex. the auxiliary node for training.

3. To give a more precise suggestion, would you mind to tell us which pruning tool you are using.
Is it from Keras?

We also have a toolkit which can “slim” the model but different from weight pruning.

TLT toolkit intends to remove the non-necessary layer and fine turn the model again.

Thanks.

thanks for the reply @AastaLLL

  • I made sure I was using the correct model when comparing times
  • for pruning I used the keras tools, my code looks like this:
model = prune.prune_low_magnitude(model,
									  pruning_schedule.PolynomialDecay(initial_sparsity=initial_sparsity,
																   final_sparsity=final_sparsity,
																	   begin_step=begin_step,
																	   end_step=end_step,
																	   power=power,
																	   frequency=frequency))

after training I do this:
model = prune.strip_pruning(model)
and the file I get is smaller (when compressed) so clearly weights have been set to 0.

when I convert to tensorRT I:

  • freeze the model
  • run the code:
trt_graph = trt.create_inference_graph(
    input_graph_def=frozen_graph,
    outputs=model_out,				 	  
    max_batch_size=1,  					 
    max_workspace_size_bytes=max_work_space,  
    precision_mode=tensorRT_precision
    is_dynamic_op=False)

I count the number of engines:
len([1 for n in trt_graph.node if str(n.op) == 'TRTEngineOp'])
and the number of nodes:
len([1 for n in trt_graph.node)

anythings else I should look at?

I haven’t had the chance to try the TLT toolkit yet.

Thanks

Hi,

Another thing worth to check is how many layers are inferenced with TensorRT.

Please noticed that the frameworks you used is TF-TRT.
TF-TRT integrated TensorRT into TensorFlow interface so the layer may be inference with either TensorFlow or TensorRT.

The layer deployment can be found in the TensorFlow log.
Could you help to collect it and share with us?

Thanks.

@AastaLLL
Yes, using TF-TRT was probably the most comfortable approach.

Can you please share some links or code on how to test the above -

  • is there a way to know if a layer is being inferenced with TensorFlow?
  • what specifically am I looking for from the logs?

Hi,

Sorry for the late update.
The layer replacement log can be enabled with this command:

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

Thanks.

1 Like

Hello @AastaLLL,

It took a while, but in continuation to this topic the situation is now this:

  • using the TLT there is an object detection Engine (based on an SSD mobilenet v2)
  • using the TLT the model was pruned (~half the original size)

Even though the model has been pruned, inference time remains - ~20-25[msec] on a TX2.

Is that the expected behaviour?
Should I prune to a much smaller size?
Is there anything I can verify (aside from file size and accuracy) in logs or anywhere, that pruning did work as expected with the end goal of having a faster inference time?

Thanks

Hi,

1.
Have you maximized the device performance first?
This can be done via following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2.
Please use fp16 mode to get a better performance on TX2.

3.
Do you have the original non-pruning model?
If yes, it’s recommended to compare the performance of pruning and non-pruning model first?

Here are some performance result for detectnet for your reference:
TX2’s performance should be in-between Nano and XavierNX.

Thanks.

@AastaLLL,
thanks for the reply -

  • I have maximized performance
  • the pruning was done via the TLT and naturally I compared original with pruned.

what level of pruning should I aim for? i.e. do I get a visible speed up at 30% (keep 30% of the original model) or do I need to go much further to 10% or similar?

Thanks

Hi,

Sorry for the late reply.

This depends on your use case.
For example, some user will expect 30 fps for their use case.
So they will stop pruning once the performance meet their requirement.

This is a trade-off between the accuracy and performance.

Thanks.