GTC 2020: Deep into Triton Inference Server: BERT Practical Deployment on NVIDIA GPU

GTC 2020 S21736
Presenters: Tianhao Xu,NVIDIA
We’ll give an overview of the TensorRT Hyperscale Inference Platform. We start with a deep dive into current features and internal architecture, then go into deployment possibilities in a generic deployment ecosystem. Next, we’ll give a hands-on overview of NVIDIA Bert, FasterTransformer and TRT-optimized BERT inference. Then we’ll get into how to deploy BERT TensorFlow model with custom op, how to deploy BERT TensorRT model with plugins, and benchmarking. We’ll finish with other optimization techniques and open discussion.

Watch this session
Join in the conversation below.

I saw this Document, and I am trying to run this Bert on Triton Infer Server. Eveything goes fine, even I can run bert_fastertransformer correctly, but only I got error in bert_trt model:
./install/bin/perf_client -m bert_trt -d -x32 -c8 -l200 -p2000 -b32 -i grpc -u -t1 --max-threads=1
I followed every step in the document, so why is this?
the error message from Server is like this:
I0416 10:01:21.706121 60] GRPC allocation failed for type 1 for cls_squad_logits
I0416 10:01:21.706179 60] GRPC allocation: cls_squad_logits, size 1024, addr 0x7f858e45b0b0
I0416 10:01:21.730079 60] Infer failed: failed to use CUDA copy for tensor ‘cls_squad_logits’: an illegal memory access was encountered
I0416 10:01:21.730093 60] InferHandler::InferComplete, 3 step ISSUED