Use of cuDNN RNN

Hello,

I will first summarize what I think I understood about cuDNN 5.1 rnn functions:

x = [seq_length, batch_size, vocab_size] # input
y = [seq_length, batch_size, hiddenSize] # output

dx = [seq_length, batch_size, vocab_size] # input gradient
dy = [seq_length, batch_size, hiddenSize] # output gradient

hx = [num_layer, batch_size, hiddenSize] # input hidden state
hy = [num_layer, batch_size, hiddenSize] # output hidden state
cx = [num_layer, batch_size, hiddenSize] # input cell state
cy = [num_layer, batch_size, hiddenSize] # output cell state

dhx = [num_layer, batch_size, hiddenSize] # input hidden state gradient
dhy = [num_layer, batch_size, hiddenSize] # output hidden state gradient
dcx = [num_layer, batch_size, hiddenSize] # input cell state gradient
dcy = [num_layer, batch_size, hiddenSize] # output cell state gradient

w = [param size] # parameters (weights & bias)
dw = [param size] # parameters gradients

cudnnRNNForwardTraining/cudnnRNNForwardInference

input: x, hx, cx, w
output: y, hy, cy

cudnnRNNBackwardData

input: y, dy, dhy, dcy, w, hx, cx
output: dx, dhx, dcx

cudnnRNNBackwardWeights

input: x, hx, y, dw
output: dw

Questions:

  1. Is the following training workflow for multi-layer RNN (num_layer > 1) correct?
init hx,cx,dhy,dcy to NULL
init w: (weights:small random values, bias: 1)

forward
backward data
backward weights
update weights: w += dw
dw = 0
goto 4
  1. Do you confirm cuDNN already implements stacked rnn when num_layer > 1? (no need to call num_layer times forward/backward methods)
  2. Should I re-inject hidden state & cell state into the network at next batch?
  3. The output in lstm formulas is hy. Should I use hy as output or y?

I am experimenting on a toy data set (x = a sentence repeated a few times, trying to predict next letter), so far the loss never converges.

network: input->lstm->fully connected->softmax
batchSize = 1
sequenceLength = 3
hiddenSize = 20
numLayers = 2
vocabSize/inputSize = 255

  1. Is the following training workflow for multi-layer RNN (num_layer > 1) correct?

It is one workflow you could use. Initial conditions for the weights and biases are up to you. A lot of people initialize the forget gate differently from the other weights, for example. Similarly, more complex weight update schemes are usually used in practise.

  1. Do you confirm cuDNN already implements stacked rnn when num_layer > 1? (no need to call num_layer times forward/backward methods)

Yes

  1. Should I re-inject hidden state & cell state into the network at next batch?

That’s up to you. There are cases where it may be beneficial to do so, there are cases where they should be initialized to zero (or left NULL). This problem isn’t really specific to the cuDNN implementation.

  1. The output in lstm formulas is hy. Should I use hy as output or y?

hy is the final hidden state. y is the hidden states at each timestep. Depending on how you want your network to behave you may want to use one or both.

Thank you for your answers! I realize I have mixed cuDNN questions with RNN questions not related to cuDNN.

I’ve made some simplifications in this pseudo code but I understand there can be various weights initialization scheme and weights update policy (e.g. Adam, Adagrad etc.)

Am I right to say that:

  • By re-injecting previous states into the network I will be able to capture time dependencies that can be longer than the sequence length (seq_length).
  • If I set those states to zero/NULL, the network will only be able to capture time dependencies within the timesteps provided to the network.

Yes. People often use re-injecting for “truncated back propagation through time”. It can also be useful in some sequence to sequence models where the encoder and decoder may have different network structures and you want communication between them.

In some cases you can’t call more than one timestep at once. For example: some generative models take the output of one timestep and feed it directly into the next. In this case you’d have to call cuDNN one timestep at a time.

Got it. Thank you!

Hi All,

I’m implementing a layer (actually in CAFFE) of RNN using cudnn API. I got correct result in an ocr task with layer_num=1, but cannot get correct result when layer_num>1.

I was wondering if I use APIs incorrectly, and I am still confused on 2 questions.

  1. When layer_num>2, the output ‘y’ shapes like [seq, batch, state] or [seq * num_layer, batch, state]?
  2. I found that dropout not working, any tips?

Thanks a lot

I’m using naked cuDNN libs in C++, without any framework.

A1: If “layer_num” is hiddenSize, then y size is seqLength * hiddenSize * miniBatch * (bidirectional ? 2 : 1)

A2: cuDNN Documentation v7.0, chapter 4.97. cudnnSetRNNDescriptor sayings: “Dropout will be applied between layers; a single layer network will have no dropout applied.”

Hi,

For A1: “layer_num” means how many layers stacked, so it is not related to hiddenSize. I’m still confused that if layer_num=2, whether the first RNN layer will output to “y”?
For A2: I’ve missed this message while reading doc, thx.

Is your naked codes open-source? If so, could you please tell me the link?
Thanks.

Hi, charlie_huang!

OK. The number of stacked layers in RNN does not affect its output size at all. These are absolutely not related.

Q: I’m still confused that if layer_num=2, whether the first RNN layer will output to “y”?

A: If layer_num=2, the first stacked layer will send its output (y of L0) to second stacked layer input (x of L1). The second stacked layer output (y of L1) is a final result of RNN of two stacked layers. Note: all stacked layers have same output size as RNN, and, of course, next input size of stacked layer equals to previos output size.
Whatever, take stacked layers RNN as “sealed pack”, only if you are not trying to add peepholes to Cuda’s LSTM : - )

I’m using cuDNN API as a GPU-accelerated library of primitives for deep neural networks in C++. No Frameworks.

Hi xnode,

Finally I get the codes correct and your answers help a lot in the debug process.
Thank you very much! :)

charlie

You’re welcome, charlie_huang!

Hi all,

After experiment with one lstm with three layer and three seperate single lstm layers stacked together, I find out the first one is 2x faster than the second approach. However, the limitation of the first one is that layer size of each layer must be the same. Is there any way that I can specify each layer’s hidden size so that they can be different?

Thanks in advance!

Hi Yujia,

You might need to know why cudnn shows great acceleration while (layer_num>=2) .

Please check the following link.
https://devblogs.nvidia.com/parallelforall/optimizing-recurrent-neural-networks-cudnn-5/

multiple layers of stacked RNN/LSTM is parallelized only if they share the same hidden state dimension.
https://devblogs.nvidia.com/parallelforall/wp-content/uploads/2016/04/image06.png

Therefore I think you may not be able to implement what you want with cudnn rnn API. If you find some different idea about this, please tell.

Wishes,
charlie

Hi Charlie,

Thank you very much for your answer! I have checked the link you give and now I understand the optimizing mechanism here.

However, when I train my cudnn_lstm in tensorflow, which should take advantage of cudnn lib under the hood, the performance of different hidden size layer stacked together(1024, 1024, 512) is somehow faster than three 1024 layers stacked together. Maybe in tensorflow, cudnn_lstm only use seperate lstm in cudnn. Emmmm… some thoughts on this?

Hi Yujia,

Sorry, I have no idea about this, since I’m using caffe instead of tensorflow. I guess you are right about how tensorflow use cudnn, but I dont have any proof. And BTW, It seems like cudnn is not well supported in tensorflow.

Bests,
Charlie

Hi Charlie,

I’m very sorry but for some reason I mistakenly measure the performance in my c++ code. First approach is only slightly outperform the second one. My setting is batch_size=128, num_step=8, embed_size=1024, layer_size=1024 for all three layers. First is about 17ms per batch and second is about 16ms per batch. I don’t know if I implement it right or cudnn just can’t optimize that much in this setting. Anyway, thank you for your time!

removed