GTC 2020: Faster Transformer

GTC 2020 S21417
Presenters: Bo Yang Hsueh,NVIDIA
Abstract
Recently, models such as BERT and XLNet, which adopt a stack of transformer layers as key components, show breakthrough performance in various deep learning tasks. Consequently, the inference performance of the transformer layer greatly limits the possibility that such models can be adopted in online services. First, we’ll show how Faster Transformer optimizes the inference computation of both the transformer encoder and decoder layers. In addition to optimizations on the standard transformer, we’ll get into how to customize Faster Transformer to accelerate a pruned transformer encoder layer together with the CUTLASS library.

Watch this session
Join in the conversation below.