Replicate 2.2ms inference time on BERT

This is a duplicated topic originally posted in the wrong category. The new topic is here: https://devtalk.nvidia.com/default/topic/1061766/tensorrt/replicate-2-2ms-inference-time-on-bert/

Please delete the topic if possible.