Optimizing Recurrent Neural Networks in cuDNN 5

jwitsoe · April 6, 2016, 9:16pm

Originally published at: https://developer.nvidia.com/blog/optimizing-recurrent-neural-networks-cudnn-5/

Figure 1: cuDNN 5 + Torch speedup vs. Torch-rnn implementation, M40, Intel® Xeon® Processor E5-2698 Network A: RNN size 2560, input size 2560, 1 layer, Seq length 200, batch size 64. Network B: RNN size 256, input size 64, 3 layers, batch size 64. Network C: RNN size 256, input size 256, 1 layer, batch…

anon30020208 · April 7, 2016, 8:32am

About optimization 2, shouldn't the hardware scheduler take care of that bit assuming that that the two sgemms have outputs in different locations?

Also, just recently researchers managed to get batch normalization to work for recurrent nets, so I am wondering whether it would be possible to combine BN with the above optimized implementations?

Probably not as is, but as a all of the optimizations above work on the linear parts, it might be more worth making specialized linear layer functions rather that recurrent ones. Then if those linear layer functions could be extended to use convolutions, it might enable the more efficient use of architectures such as neural GPUs.

Also cuDNN currently lacks a broadcast matrix-vector multiplication operation such as the one that is surely used inside the BN functions. Another weakness of the above arrangement and in favor of having functions specifically for linear layers is that it would allow one to use activations such as steep sigmoid or PrRelu.

anon60850134 · April 7, 2016, 6:47pm

> About optimization 2, shouldn't the hardware scheduler take care of that
bit assuming that that the two sgemms have outputs in different
locations?

Without additional information it's very hard/impossible to prove that two calls can be executed in parallel. Different calls won't be executed concurrently unless the user allows them to be via streams.

> Also, just recently researchers managed to get batch normalization
to work for recurrent nets, so I am wondering whether it would be
possible to combine BN with the above optimized implementations?

Thanks for the link. This (and your other comments) are interesting directions. There are a lot of different ways you could modify RNNs, and covering them all while keeping memory usage low, performance high, and the API easy to use is tricky. The optimization strategies I've described in this blog post are all possible to implement in CUDA, and modifications to your favourite framework to allow some/all of them would hopefully lead to similar speedup for custom RNNs.

anon89306028 · April 13, 2016, 5:31am

Thanks Jeremy, great post. It never occurred to me to batch dot products together in this way.

anon63849446 · May 18, 2016, 10:13pm

I'm confused. In this article's opener, it mentions the new native support for RNNs in cuDNN v5. Yet nowhere in the article and nowhere in the associated code samples in github do I find any use of the new tensor descriptors, nor RNN descriptors, nor any of the new RNN-related API calls. In fact, I don't see anything in your code that's demonstrating any of the new RNN features of cuDNN v5. Please correct me if I've missed something.

anon81108252 · January 31, 2017, 6:27pm

Nice post Jeremy! I have a question about Step 1 Optimization 2 in which independent matrix multiplications within an LSTM are enqueued into different streams. This option is not available in cuDNN (v 5 as of now), correct?

anon60850134 · February 4, 2017, 5:30pm

There are corner cases where only one stream is possible. Excluding them, cuDNN v5 should be using this optimization under-the-hood.

anon81108252 · February 4, 2017, 6:46pm

But cuDNN documentation states that cudnnSetStream basically sets the one stream to execute all subsequent cuDNN calls with. Does cuDNN create and use implicitly-defined private streams for this purpose?

anon60850134 · February 4, 2017, 7:22pm

That's correct. These streams will not start to execute before all prior work is completed on the stream set by cudnnSetStream, nor will work scheduled to that stream begin while the private streams are still executing. In other words, from the caller perspective, behaviour is the same as if there were no private streams.

anon74292239 · July 3, 2017, 5:21am

Very nice post. Thanks for sharing the optimizations.

Is it possible to share the dimensions of each matrix in a single SGEMM call? I am assuming that out of 8 SGEMM calls, there are 2 groups of 4 SGEMMs, all SGEMMs in a group will have same input matrices dimensions. I am trying to understand the calculations and the how large matrices are. Thanks.

anon60850134 · July 5, 2017, 3:29pm

Typically each of the eight matrices are square with side length equal to the hidden state size of the model. In the first layer the non-recurrent input may be a different dimension, in which case rectangular matrices need to be used for those four operations. The hidden state size and input size are user defined and typically lie in the 256-2048 range, though larger and smaller values are also sometimes used.

anon74292239 · July 10, 2017, 4:41am

Thanks. That makes sense to a large extent. I will look closely into more calculations to understand it completely.

Also, is there any framework which takes advantage of these optimizations? I think the DNN frameworks need to be updated so that they can take advantage of the optimizations. Please let me know if there is a framework available with this implementation.

anon28062191 · July 10, 2017, 7:45am

It's ok, the article is about optimisation, not about the new tensor descriptors.

anon60850134 · July 10, 2017, 12:22pm

Most major frameworks have ways to interact with the cuDNN RNN API (eg. Tensorflow, PyTorch, Caffe2 and more).

anon17317457 · March 14, 2018, 6:16am

Thanks for the great tutorial.

I am trying to understand better what "fusing element-wise operation" means. Please let me know if the following is correct:

Suppose I want to compute ReLU(A*B). Without fusing the pointwise operation, this means that I launch a GEMM kernel to compute A*B. Once the kernel is finished it will send the product C=A*B back to global memory. Then I will launch a kernel to compute ReLU(C). To do this I will need to go to fetch the matrix C in global memory, send it to the device, and then threshold all the entries of C. Obviously in this last step, all the time is spent in fetching the matrix C from global memory. The goal of fusing is to eliminate this unnecessary fetching time.

In the "fusing scenario" we launch a single GEMM kernel, with simply an extra line at the end of the kernel code to threshold each entry of C once they become available available.

Did I understand correctly what "fusing element-wise operation" means?

Thanks

anon17317457 · March 14, 2018, 6:33am

Thanks for the great tutorial.

I am trying to understand better what "fusing element-wise operation" means. Please let me know if the following is correct:

Suppose I want to compute ReLU(A*B). Without fusing the pointwise operation, this means that I launch a GEMM kernel to compute A*B. Once the kernel is finished it will send the product C=A*B back to global memory. Then I will launch a kernel to compute ReLU(C). To do this I will need to go to fetch the matrix C in global memory, send it to the shared memory, and then threshold all the entries of C. Obviously in this last step, all the time is spent in fetching the matrix C from global memory. The goal of fusing is to eliminate this unnecessary fetching time.

In the "fusing scenario" we launch a single GEMM kernel, with simply an extra line at the end of the kernel code to threshold each entry of C once they become available available.

Did I understand correctly what "fusing element-wise operation" means?

Thanks

anon60850134 · March 15, 2018, 2:15pm

> Did I understand correctly what "fusing element-wise operation" means?

Yes. In this case the optimizations are not fusing operations with the GEMM kernel, but instead implementing all of the tanh/sigmoid/addition/multiplication operations as a single kernel rather than as individual kernels.

Fusing these operations with GEMM kernels isn't out of the question. In this recent post on CUTLASS: https://devblogs.nvidia.com... there is an example showing how a GEMM can be fused with a bias and ReLU activation function.

anon33321661 · August 8, 2018, 6:52am

Thank you for this excellent blog. I have some question about Optimization 5. I think in every sub-iteration, when we finish the computation in stream B, we need to perform point-wise computation because the next iteration requires outputs from last iteration. I don't know if my idea is right. So any answer will be welcomed, thank you.

Topic		Replies	Views
some questions about lstm optimization CUDA Programming and Performance	0	434	December 6, 2018
Why is 2-D convolution slower than the matrix product? CUDA Programming and Performance	17	6968	April 18, 2015
CuDNN RNN cuDNN	1	1287	August 4, 2022
NVIDIA cuDNN 9로 트랜스포머 가속화 Technical Blog - South Korea cudnn	1	143	June 12, 2024
inconsistency in cudnnRNNMode_t GPU-Accelerated Libraries	0	641	May 3, 2016
LSTMP doesn't work in cudnn 7.1.4 cuDNN	2	3138	February 18, 2021
The Full Stack Optimization Powering NVIDIA MLPerf Training v2.0 Performance Technical Blog	0	428	June 29, 2022
Accelerate Machine Learning with the cuDNN Deep Neural Network Library Technical Blog	32	892	February 26, 2018
Use of cuDNN RNN cuDNN	16	6762	September 20, 2018
Use of cudnn rnn forwardtraining and backwardtraining cuDNN	11	2305	October 13, 2018

Optimizing Recurrent Neural Networks in cuDNN 5

Related topics