Real-Time Natural Language Processing with BERT Using NVIDIA TensorRT (Updated)

Originally published at: Real-Time Natural Language Processing with BERT Using NVIDIA TensorRT (Updated) | NVIDIA Developer Blog

Today, NVIDIA is releasing TensorRT 8.0, which introduces many transformer optimizations. With this post update, we present the latest TensorRT optimized BERT sample and its inference latency benchmark on A30 GPUs. Using the optimized sample, you can execute different batch sizes for BERT-base or BERT-large within the 10 ms latency budget for conversational AI applications.