Optimizing Inference Performance for Transformers-based Models on Nvidia GPUs

I’m working on a natural language processing project using transformers-based models (specifically, BERT and its variants) and I’m experiencing slower inference performance on my Nvidia GPU compared to CPU. I’ve tried using TensorRT and optimizing the model using the Nvidia TensorRT Optimization Tool, but I’m still not seeing the performance gains I expect.

Specifically:

  • How can I further optimize my model and inference pipeline to take advantage of Nvidia GPU acceleration?

  • Are there any specific techniques or best practices for optimizing transformers-based models on Nvidia GPUs that I’m missing?

  • Are there any upcoming features or updates in Nvidia’s AI software stack that will improve performance for transformers-based models?

Additional context:

  • I’m using Python, PyTorch, and the Nvidia GPU Cloud (NGC) containers for my development environment.

  • My dataset is relatively small (~100k samples), and I’m using a single Nvidia V100 GPU for inference.

Goal:

I’m aiming to achieve at least 2x faster inference performance on my Nvidia GPU compared to CPU, and I’m hoping the community can provide guidance on how to achieve this.