Introduction to Neural Machine Translation with GPUs (part 3)

In the vanilla form of backpropagation through time (without truncation), one update of the weights requires reading a full sequence. If the sequence is really long (like 10s or 100s of thousands), it's probably not great to do this.

I suggest you to read Sec. 4.1.4 of my lecture note: http://arxiv.org/abs/1511.0...

Please look at https://github.com/kyunghyu... which was written by Orhan Firat

Unfortunately RNN is known to adapt to the lengths it has seen during training. This is perhaps the reason why this happens. For more interesting approaches to build an RNN that generalizes to sequences much longer than it has seen during training, see for instance http://papers.nips.cc/paper....

It's simply to concatenate the input, the previous hidden state and the context created by taking the weighted sum of the source annotation vectors (with the glimpses as the weights.)

Many thanks for your reply!

Many thanks for your nice post and illustration. I have one question regarding the possible dependencies in attention mechanism.

Have you ever tried other dependencies? I mean, for example, in your model, s(i) is dependent on c(i) or e(ij) is based on only s(i-1),not on the last two state. I am really interested to know have you tried other dependencies and you reached the conclusion that, this one works the best?

Is there any special meaning behind these dependencies?

Many thanks and sorry if the question is not relevant.

Has anyone implemented this or is willing to implement this in deeplearning4j?

I am looking for something to use as a starting point to try and translate Cherokee/English using a small corpus. (Only small corpus is available).

The combination of extreme language dissimilarity and small corpus does not work well with SMT.

I see that y'all have authored a paper on uses non-aligned corpus materials.

Why not train a neural network to align paragraphs between two translations of the same document? (which would also indicate which paragraphs to treat as non-alignable).

Not that I know of, but it'll be a great service to the community if someone implements it in deeplearning4j.

That is certainly an interesting problem, but may not necessarily require a neural net to do so. There have been quite a lot of work already on automatically aligning source and target sentences:

https://scholar.google.com/...

I haven't tried it before, but certainly it is an active research topic. Some recent papers that I can think right away include (but definitely are not limited to):

http://arxiv.org/abs/1608.0...
http://arxiv.org/abs/1607.0...
http://arxiv.org/abs/1605.0...
http://papers.nips.cc/paper...

Many more such works have been presented at ACL, NAACL and EMNLP this year.

Where would be a good place to ask for someone to do this? Assuming deeplearning4j is capable.

Hi, Adam from deeplearning4j here. I haven't seen anyone doing machine translation work. We have a few NLP shops that write papers with us but none are doing translation. The new custom layer support should make this easier as well. We have seq2seq built in but not much beyond that. We haven't seen much in the way of machine translation work ourselves either.

So, where does one go to ask for a 3rd party to do this kinda of stuff?

Can anyone explain to me what is the advantage of multiple hidden layers in encoder over single layer? Basically understanding the advantage of multiple layers in CNN is intuitive, but for LSTM, its not very obvious for me.

Thanks for your easy explanation, I was able to improve my understanding of NMT. Thank you.