GTC 2020: Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing

GTC 2020 S22459
Presenters: David Goodwin,NVIDIA ; Dan Sun,Bloomberg
Abstract
Large-scale language models, such as BERT and GPT-2, have brought about exciting leaps in state-of-the-art accuracy for many NLP tasks. Due to its multi-head attention network, BERT requires significant compute during inference, which poses challenges for real-time application performance. KFServing provides model serving interfaces for common ML frameworks like TensorFlow, XGBoost, SKLearn, PyTorch, ONNX and NVIDIA’s TensorRT. Built on Kubernetes CRDs and KNative, KFServing enables hardware acceleration and autoscaling of Bloomberg’s own BERT models trained on a corpora of specialized financial news data. We’ll discuss how the Bloomberg Data Science Platform uses KFServing to address latency and scalability in a production application. In addition to its scalability features, KFServing provides a standardized data plane across model frameworks and servers. We’ll also present the community proposal for a v2 REST/gRPC data plane, along with its integration in TRTIS.

Watch this session
Join in the conversation below.