Should pruning a model prior to converting it to tensorRT make inference faster?

alex73 · May 14, 2020, 12:52am

The goal - get faster inference time, running on TX2

The flow:

I have a keras model which I have trained and converted to tensorRT, using the function - “trt.create_inference_graph”. Inference time ~27[msec].
after pruning the model and converting it to tensorRT, inference time remains the same.
pruning was done using the “prune_low_magnitude” function, setting the final sparsity to ~90% in “PolynomialDecay”
the pruned model is significantly smaller if compressed (~3 times smaller), which indicates that there is a large number of 0s, as expected.

The questions:

does inference time depend only on the number of nodes and engines of a tensorRT model?
i.e. if after pruning I get the same number as without pruning, then inference time will surely be the same?
is there another step after pruning a keras model, prior to converting to tensorRT, to enable a faster inference?
i.e. pruning gives me just a different set of weights, many of them are 0, which is good to get a small compressed file, but is there a way to “remove” them from the flow or ignore them somehow to get a faster inference?
the pruned tensorRT (.pb) file I got was just a little smaller than the original - is that an indication for something or just a “fluke”?
I have pruned to various level (30%, 70%…) but tensorRT model remains roughly the same in size, smaller than original but is not descending with pruning percentage.

Thanks for the help

alex73 · May 14, 2020, 11:34pm

in continuation to the above, if I look at the link below, there are various API for tensorRT -
https://on-demand.gputechconf.com/gtc-cn/2019/pdf/CN9456/presentation.pdf,
will using a different API with a pruned model give me a speedup?

alex73 · June 17, 2020, 4:02am

@2024a - anyone please?

AastaLLL · June 19, 2020, 2:51am

Hi,

Sorry for the late update.
This post is somehow missing in our process.

1.
In general, TensorRT inference time depends on how much kernel you have launched and how long it takes.
If most of the weights are pruned to 0, it should lead to much lower kernel jobs and faster.

Is it possible that you are still using the old model?
Please noticed that you will need to recreate the TensorRT engine if you have serialized it.

2. You can try to remove all the non-necessary node. Ex. the auxiliary node for training.

3. To give a more precise suggestion, would you mind to tell us which pruning tool you are using.
Is it from Keras?

We also have a toolkit which can “slim” the model but different from weight pruning.

TLT toolkit intends to remove the non-necessary layer and fine turn the model again.

Thanks.

alex73 · June 21, 2020, 11:02pm

thanks for the reply @AastaLLL

I made sure I was using the correct model when comparing times
for pruning I used the keras tools, my code looks like this:

model = prune.prune_low_magnitude(model,
									  pruning_schedule.PolynomialDecay(initial_sparsity=initial_sparsity,
																   final_sparsity=final_sparsity,
																	   begin_step=begin_step,
																	   end_step=end_step,
																	   power=power,
																	   frequency=frequency))

after training I do this:
model = prune.strip_pruning(model)
and the file I get is smaller (when compressed) so clearly weights have been set to 0.

when I convert to tensorRT I:

freeze the model
run the code:

trt_graph = trt.create_inference_graph(
    input_graph_def=frozen_graph,
    outputs=model_out,				 	  
    max_batch_size=1,  					 
    max_workspace_size_bytes=max_work_space,  
    precision_mode=tensorRT_precision
    is_dynamic_op=False)

I count the number of engines:
len([1 for n in trt_graph.node if str(n.op) == 'TRTEngineOp'])
and the number of nodes:
len([1 for n in trt_graph.node)

anythings else I should look at?

I haven’t had the chance to try the TLT toolkit yet.

Thanks

AastaLLL · June 22, 2020, 5:59am

Hi,

Another thing worth to check is how many layers are inferenced with TensorRT.

Please noticed that the frameworks you used is TF-TRT.
TF-TRT integrated TensorRT into TensorFlow interface so the layer may be inference with either TensorFlow or TensorRT.

The layer deployment can be found in the TensorFlow log.
Could you help to collect it and share with us?

Thanks.

alex73 · June 22, 2020, 10:11am

@AastaLLL
Yes, using TF-TRT was probably the most comfortable approach.

Can you please share some links or code on how to test the above -

is there a way to know if a layer is being inferenced with TensorFlow?
what specifically am I looking for from the logs?

AastaLLL · July 8, 2020, 7:30am

Hi,

Sorry for the late update.
The layer replacement log can be enabled with this command:

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

Thanks.

alex73 · August 24, 2020, 4:18am

Hello @AastaLLL,

It took a while, but in continuation to this topic the situation is now this:

using the TLT there is an object detection Engine (based on an SSD mobilenet v2)
using the TLT the model was pruned (~half the original size)

Even though the model has been pruned, inference time remains - ~20-25[msec] on a TX2.

Is that the expected behaviour?
Should I prune to a much smaller size?
Is there anything I can verify (aside from file size and accuracy) in logs or anywhere, that pruning did work as expected with the end goal of having a faster inference time?

Thanks

AastaLLL · September 4, 2020, 5:44am

Hi,

1.
Have you maximized the device performance first?
This can be done via following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2.
Please use fp16 mode to get a better performance on TX2.

3.
Do you have the original non-pruning model?
If yes, it’s recommended to compare the performance of pruning and non-pruning model first?

Here are some performance result for detectnet for your reference:
TX2’s performance should be in-between Nano and XavierNX.

Thanks.

alex73 · September 7, 2020, 10:28pm

@AastaLLL,
thanks for the reply -

I have maximized performance
the pruning was done via the TLT and naturally I compared original with pruned.

what level of pruning should I aim for? i.e. do I get a visible speed up at 30% (keep 30% of the original model) or do I need to go much further to 10% or similar?

Thanks

AastaLLL · September 21, 2020, 7:14am

Hi,

Sorry for the late reply.

This depends on your use case.
For example, some user will expect 30 fps for their use case.
So they will stop pruning once the performance meet their requirement.

This is a trade-off between the accuracy and performance.

Thanks.

Topic		Replies	Views
Does network pruning speed up inference speed? TensorRT	6	1687	January 7, 2022
Inference is so slow with torch1.6 Jetson Xavier NX nvbugs , pytorch	12	3538	October 23, 2020
Lower performance with TRT than plain TF? Jetson Xavier NX tensorrt , jetson-inference	14	1956	October 18, 2021
optimizing tf-trt load time Jetson Nano	12	4175	October 15, 2021
Low Compute utilization of converted TensorFlow model during inference Jetson TX2	19	1695	October 18, 2021
Time of inference in FP16 and FP32 is the same Jetson TX2 tensorrt	20	1691	August 10, 2022
Inference time using TF-TRT is the same as Native Tensorflow for Object Detection Models TensorRT tensorrt , tf-trt	4	1008	March 31, 2022
Tlt-infer is slow TAO Toolkit	13	830	October 12, 2021
Inference time changes after training TensorRT tensorrt	5	578	September 25, 2020
Inference time on jetson nano Jetson AGX Xavier tensorrt , cuda , kernel , jetson-inference	2	940	May 30, 2022

Should pruning a model prior to converting it to tensorRT make inference faster?

Related topics