tensorflow performence improvements is not linear before/after jetson_clocks.sh

Hello, I’m using tf1.7 with jackpak3.2. I’ve two models, one is little (6.3MB) the other is twice larger (13.6MB).

After jetson_clocks, the little one accelerated from 0.125 s/f to 0.043 s/f, however, the larger one only from 0.332 s/f to 0.286 s/f.

  1. Why one is accelerated well but the other one is not?

Anthoer interesting thing, after I keep the input image only half size, the larger one drops from 0,.286 s/f to 0.143 s/f, while the little one keeps around 0.040 s/f.

  1. So, why this time the little one doesn’t reduce the execution time along with the input image size reduction?

How can I accelerate my model correctly?

Hope help, thanks.

Hi,

We need more information to give a dedicated suggestion.
Could you profile your model with nvprof and share the results with us?

sudo ./nvprof -o output.nvprof [your program]

Here are some common causes for your reference:
1. Some layers inside your model is not suitable for GPU architecture.
2. The model is too small to gain performance from cutting down batch size.

Thanks.

Hello AastaLLL, thank you for reply.

Here the nvprof result with jeston_clocks.sh, https://drive.google.com/file/d/1MiFQhV9mnlR1oSveYT_OM9HZZRc34tH5/view?usp=sharing

Without jeston_clocks.sh, inferring one frame (100x500) tf_1.7 needs 0.35s, with jeston_clocks.sh it is 0.286 s.

My model is pure 10 conv2d_3x3s, with 2 full connection at end. The input is 100x500, output is also 100x500.

I cannot understand why my littler model benefits so much from jeston_clocks.sh while the larger one can not.

Hope reply.

Hi,

You can check the nvprof data with NVVP on the host.

Based on the result, your application takes most of the time in CUDA preparation and memory free/allocate.
That why the improvement of inferencing is slightly in your use-case.

Thanks.