TensorRT (float32) results are bad on GoogleNet

The TensorRT engine (float32) created from GoogleNet models shows a big deviation in result from Caffe model. Why is this? How can it be fixed?

This experiment was conducted at float32 on many GPUs (1080Ti, K80, M60), all showing similar bad results for TensorRT engine created from Googlenet.

I used the GPU REST Engine (GitHub - NVIDIA/gpu-rest-engine: A REST API for Caffe using Docker and Go) to use TensorRT 2.1 on the GoogleNet model of ImageNet.

Dockerfile for Caffe server Googlenet:

Dockerfile for TensorRT server Googlenet:

Top-1 error on ImageNet validation set:
Caffe: 0.29386
TensorRT: 0.31972

Note that TensorRT model shows a noticeable jump in error, which is very suspicious. Similar big deviation in results are observed on Googlenet models trained on our own private training data.

The same experiment conducted on ResNet-50 shows that TensorRT performs exactly like the Caffe model. That is, it is working correctly for ResNet-50.

What is special about GoogleNet and why is TensorRT result deviating so much from the Caffe model? The deviation is so high and mysterious that it cannot be used in production for any Googlenet model trained on other data. Is there anything I can do to get Googlenet results exactly the same or close to the Caffe model?

NVIDIA has shown GoogleNet in many of its TensorRT promotional slides and talks. Can NVIDIA share how the TensorRT Googlenet results at float32 was achieved?