Why the TensorRT plan files generated on TX2 with the same caffe deploy file and weight file run at big different speed?

I found that on TX2, the TensorRT plan files generated with the same Caffe deploy file and weight file run at different speeds, and the difference is very large. Especially in fp16 mode, the fastest plan file runs at 18.4 ms/image, the slowest plan file runs at 27 ms/image. My detection model is a variant SSD and is built only with TensorRT build-in layers without any custom layers. Has anyone else encountered a similar problem?

I use JetPack 3.3(TensorRT 4.0.1.6+Ubuntu 1604+ CUDA 9.0+cuDNN v7.1.5)

========================
/usr/src/tensorrt/bin/trtexec --output=confidence --output=bboxes --avgRuns=100 --iterations=6 --output=confidence --output=bboxes --deploy=$HOME/data/detector_gray_960x540.prototxt --model=$HOME/data/detector_gray_960x540.caffemodel --engine=$HOME/models/fp16_$counter.eng

Generating plan file 1
output: confidence
output: bboxes
hostTime
fp16
avgRuns: 100
iterations: 6
deploy: detector_gray_960x540.prototxt
model: detector_gray_960x540.caffemodel
engine: /home/ubuntu/models/fp16_1.eng
Input “data”: 1x540x960
Output “confidence”: 1x21828x3
Output “bboxes”: 1x21828x4
name=data, bindingIndex=0, buffers.size()=3
name=confidence, bindingIndex=1, buffers.size()=3
name=bboxes, bindingIndex=2, buffers.size()=3
Average over 100 runs is 18.3816 ms
Average over 100 runs is 18.3787 ms

Generating plan file 5
output: confidence
output: bboxes
hostTime
fp16
avgRuns: 100
iterations: 6
deploy: detector_gray_960x540.prototxt
model: detector_gray_960x540.caffemodel
engine: /home/ubuntu/models/fp16_5.eng
Input “data”: 1x540x960
Output “confidence”: 1x21828x3
Output “bboxes”: 1x21828x4
name=data, bindingIndex=0, buffers.size()=3
name=confidence, bindingIndex=1, buffers.size()=3
name=bboxes, bindingIndex=2, buffers.size()=3
Average over 100 runs is 27.0344 ms
Average over 100 runs is 27.0445 ms

Hi,

Have you maximized the CPU/GPU clock before measuring the performance?

sudo nvpmodel -m 0
sudo ./jeston_clock.sh

Thanks.

Thanks a lot for your reply. The above results are measured without maximized the CPU/GPU. After maximizing the CPU/GPU clock, the speed is much faster. But what bothers me the most is why the optimized plan file generated with the same deploy file and weight file runs at such a different speed.
As shown in the follows, These two plan files are generated with the same command and same weight file, but one is 7.5 ms faster than the others when maximizing the CPU/GPU clock and 8.7 ms faster than the others before maximizing the CPU/GPU clock.

$ /usr/src/tensorrt/bin/trtexec --output=confidence --output=bboxes --hostTime --output=confidence --output=bboxes --avgRuns=100 --iterations=6 --engine=./fp16_5.eng
output: confidence
output: bboxes
hostTime
output: confidence
output: bboxes
avgRuns: 100
iterations: 6
engine: ./fp16_5.eng
name=data, bindingIndex=0, buffers.size()=5
name=confidence, bindingIndex=1, buffers.size()=5
name=bboxes, bindingIndex=2, buffers.size()=5
name=confidence, bindingIndex=1, buffers.size()=5
name=bboxes, bindingIndex=2, buffers.size()=5
Average over 100 runs is 23.3359 ms
Average over 100 runs is 23.3098 ms
Average over 100 runs is 23.2954 ms
Average over 100 runs is 23.3333 ms
Average over 100 runs is 23.3341 ms
Average over 100 runs is 23.2938 ms

$ /usr/src/tensorrt/bin/trtexec --output=confidence --output=bboxes --hostTime --output=confidence --output=bboxes --avgRuns=100 --iterations=6 --engine=./fp16_1.eng
output: confidence
output: bboxes
hostTime
output: confidence
output: bboxes
avgRuns: 100
iterations: 6
engine: ./fp16_1.eng
name=data, bindingIndex=0, buffers.size()=5
name=confidence, bindingIndex=1, buffers.size()=5
name=bboxes, bindingIndex=2, buffers.size()=5
name=confidence, bindingIndex=1, buffers.size()=5
name=bboxes, bindingIndex=2, buffers.size()=5
Average over 100 runs is 15.8488 ms
Average over 100 runs is 15.8245 ms
Average over 100 runs is 15.8264 ms
Average over 100 runs is 15.8091 ms
Average over 100 runs is 15.8143 ms
Average over 100 runs is 15.8326 ms

Hi,

Could you share these two model for us investigating?

Thanks.