Optimizing Recurrent Neural Networks in cuDNN 5

Originally published at: https://developer.nvidia.com/blog/optimizing-recurrent-neural-networks-cudnn-5/

Figure 1: cuDNN 5 + Torch speedup vs. Torch-rnn implementation, M40, Intel® Xeon® Processor E5-2698 Network A: RNN size 2560, input size 2560, 1 layer, Seq length 200, batch size 64. Network B: RNN size 256, input size 64, 3 layers, batch size 64. Network C: RNN size 256, input size 256, 1 layer, batch…

About optimization 2, shouldn't the hardware scheduler take care of that bit assuming that that the two sgemms have outputs in different locations?

Also, just recently researchers managed to get batch normalization to work for recurrent nets, so I am wondering whether it would be possible to combine BN with the above optimized implementations?

Probably not as is, but as a all of the optimizations above work on the linear parts, it might be more worth making specialized linear layer functions rather that recurrent ones. Then if those linear layer functions could be extended to use convolutions, it might enable the more efficient use of architectures such as neural GPUs.

Also cuDNN currently lacks a broadcast matrix-vector multiplication operation such as the one that is surely used inside the BN functions. Another weakness of the above arrangement and in favor of having functions specifically for linear layers is that it would allow one to use activations such as steep sigmoid or PrRelu.

> About optimization 2, shouldn't the hardware scheduler take care of that
bit assuming that that the two sgemms have outputs in different
locations?

Without additional information it's very hard/impossible to prove that two calls can be executed in parallel. Different calls won't be executed concurrently unless the user allows them to be via streams.

> Also, just recently researchers managed to get batch normalization
to work for recurrent nets, so I am wondering whether it would be
possible to combine BN with the above optimized implementations?

Thanks for the link. This (and your other comments) are interesting directions. There are a lot of different ways you could modify RNNs, and covering them all while keeping memory usage low, performance high, and the API easy to use is tricky. The optimization strategies I've described in this blog post are all possible to implement in CUDA, and modifications to your favourite framework to allow some/all of them would hopefully lead to similar speedup for custom RNNs.

Thanks Jeremy, great post. It never occurred to me to batch dot products together in this way.

I'm confused. In this article's opener, it mentions the new native support for RNNs in cuDNN v5. Yet nowhere in the article and nowhere in the associated code samples in github do I find any use of the new tensor descriptors, nor RNN descriptors, nor any of the new RNN-related API calls. In fact, I don't see anything in your code that's demonstrating any of the new RNN features of cuDNN v5. Please correct me if I've missed something.

Nice post Jeremy! I have a question about Step 1 Optimization 2 in which independent matrix multiplications within an LSTM are enqueued into different streams. This option is not available in cuDNN (v 5 as of now), correct?

There are corner cases where only one stream is possible. Excluding them, cuDNN v5 should be using this optimization under-the-hood.

But cuDNN documentation states that cudnnSetStream basically sets the one stream to execute all subsequent cuDNN calls with. Does cuDNN create and use implicitly-defined private streams for this purpose?

That's correct. These streams will not start to execute before all prior work is completed on the stream set by cudnnSetStream, nor will work scheduled to that stream begin while the private streams are still executing. In other words, from the caller perspective, behaviour is the same as if there were no private streams.

Very nice post. Thanks for sharing the optimizations.

Is it possible to share the dimensions of each matrix in a single SGEMM call? I am assuming that out of 8 SGEMM calls, there are 2 groups of 4 SGEMMs, all SGEMMs in a group will have same input matrices dimensions. I am trying to understand the calculations and the how large matrices are. Thanks.

Typically each of the eight matrices are square with side length equal to the hidden state size of the model. In the first layer the non-recurrent input may be a different dimension, in which case rectangular matrices need to be used for those four operations. The hidden state size and input size are user defined and typically lie in the 256-2048 range, though larger and smaller values are also sometimes used.

Thanks. That makes sense to a large extent. I will look closely into more calculations to understand it completely.

Also, is there any framework which takes advantage of these optimizations? I think the DNN frameworks need to be updated so that they can take advantage of the optimizations. Please let me know if there is a framework available with this implementation.

It's ok, the article is about optimisation, not about the new tensor descriptors.

Most major frameworks have ways to interact with the cuDNN RNN API (eg. Tensorflow, PyTorch, Caffe2 and more).

Thanks for the great tutorial.

I am trying to understand better what "fusing element-wise operation" means. Please let me know if the following is correct:

Suppose I want to compute ReLU(A*B). Without fusing the pointwise operation, this means that I launch a GEMM kernel to compute A*B. Once the kernel is finished it will send the product C=A*B back to global memory. Then I will launch a kernel to compute ReLU(C). To do this I will need to go to fetch the matrix C in global memory, send it to the device, and then threshold all the entries of C. Obviously in this last step, all the time is spent in fetching the matrix C from global memory. The goal of fusing is to eliminate this unnecessary fetching time.

In the "fusing scenario" we launch a single GEMM kernel, with simply an extra line at the end of the kernel code to threshold each entry of C once they become available available.

Did I understand correctly what "fusing element-wise operation" means?

Thanks

Thanks for the great tutorial.

I am trying to understand better what "fusing element-wise operation" means. Please let me know if the following is correct:

Suppose I want to compute ReLU(A*B). Without fusing the pointwise operation, this means that I launch a GEMM kernel to compute A*B. Once the kernel is finished it will send the product C=A*B back to global memory. Then I will launch a kernel to compute ReLU(C). To do this I will need to go to fetch the matrix C in global memory, send it to the shared memory, and then threshold all the entries of C. Obviously in this last step, all the time is spent in fetching the matrix C from global memory. The goal of fusing is to eliminate this unnecessary fetching time.

In the "fusing scenario" we launch a single GEMM kernel, with simply an extra line at the end of the kernel code to threshold each entry of C once they become available available.

Did I understand correctly what "fusing element-wise operation" means?

Thanks

> Did I understand correctly what "fusing element-wise operation" means?

Yes. In this case the optimizations are not fusing operations with the GEMM kernel, but instead implementing all of the tanh/sigmoid/addition/multiplication operations as a single kernel rather than as individual kernels.

Fusing these operations with GEMM kernels isn't out of the question. In this recent post on CUTLASS: https://devblogs.nvidia.com... there is an example showing how a GEMM can be fused with a bias and ReLU activation function.

Thank you for this excellent blog. I have some question about Optimization 5. I think in every sub-iteration, when we finish the computation in stream B, we need to perform point-wise computation because the next iteration requires outputs from last iteration. I don't know if my idea is right. So any answer will be welcomed, thank you.