GTC 2020: XLNet Optimization Using CUDA

GTC 2020 S21478
Presenters: Christina Zhang,NVIDIA
Abstract
XLNet, a generalized autoregressive pretraining method, achieved great results on several natural language processing tasks. Compared to the previous language model, XLNET has advantages like being able to process long sentences, and avoids the disadvantage of using special tokens. However, as far as we know, there still isn’t proper performance optimization for XLNet using CUDA, which would demand more inference time and hinder XLNET’s wide deployment. We first ran the performance analysis of XLNet using its Tensorflow code. Then we optimized XLNet with these aspects:

  1. For relative positional encoding, we optimized its parallelization with the help of cuBlas;
  2. We customized the corresponding self-attention architecture based on the attention code in FastTransformer; and
  3. We used kernel fusion and other CUDA optimization strategies to speedup XLNet.

Watch this session
Join in the conversation below.