I’m trying to use TF-TRT to accelerate inference on BERT. The conversion and calibration works fine, and the inference in batches that doesn’t need building new TRT Engine is fast. However, I observed extremely long time (~100s) and high H2D memcpy (~9GB) for building the engine, while my model consists of only ~50 million parameters (i.e. the byte size of parameters will be 200MB). The CPU-to-GPU memcpy, according to nvprof, happens every 4 cuBLAS batched gemm (see the screenshot below; each blue or cyan block corresponds to a gemm, which is part of a (2048,512) * (512,512) matrix multiplication for fully connected layer), transferring 1MB data, which I believe is the weight parameter of the fully connected layer (5125124/1024/1024=1MB). There’re even some repeated memcpy’s not followed by any gemm computation.
I’m wondering why TF-TRT is doing these redundant memcpy, for every 4 gemm, or even for no gemm.
TensorRT Version : 7.1.3
GPU Type : Tesla V100 32GB
Nvidia Driver Version :440.33.01
CUDA Version : 10.2
CUDNN Version : 8.0.2
Operating System + Version : RHEL 7
Python Version (if applicable) : 3.7
TensorFlow Version (if applicable) :2.4.0 (built from source)