Introduction to Neural Machine Translation with GPUs (part 3)

Originally published at: https://developer.nvidia.com/blog/introduction-neural-machine-translation-gpus-part-3/

Note: This is the final part of a detailed three-part series on machine translation with neural networks by Kyunghyun Cho. You may enjoy part 1 and part 2. In the previous post in this series, I introduced a simple encoder-decoder model for machine translation. This simple encoder-decoder model is excellent at English-French translation. However, in this…

Thank you! Very good post!!

"Furthermore, we can consider this set of context-dependent word representations as a mechanism by which we store the source sentence as a variable-length representation, as opposed to the fixed-length, fixed-dimensional summary from the simple encoder-decoder model."

1. Does that mean the representation length of the source sentence is determined by the number of annotation vector h and the length of h?

2. In addition, I think expected vector c_i can be seen as the final representations of the source sentence, whose length is fixed (the dimension of h). Is that right?

Hi Minglei,

1. Ues and no, I meant that the length of the representation is proportional to the length of the source sentence, which is proportional to the number of annotation vectors.

2. Only at time i. The representation c_i changes w.r.t. i according to the attention weights which are determined by the decoder's hidden state (which again changes for each and every target word generated.)

Cheers
- K

Hey, thanks for the great post! NMT is one of the coolest recent applications of NNets out there.

I have two questions, however:
- Can Your code be used to train an actual working NMT model, or it's just a toy example for education purposes?
- In general, can a good NMT model be trained on a GTX (980) with 4Gibs of RAM, or these models need so many parameters they require a cluster of Titan Xs?

Edit:
I've got one more question.
When we use decoder to produce translation, it is obvious that we feed the output back to the input (like in character-level RNN language model).
During training however, what is the input to the decoder? Also it's own output, or rather the groundtruth translation? When training the char-RNN we input the true letters, not the sampled ones.

Hi Marcin,

- "Can Your code be used to train an actual working NMT model, or it's just a toy example for education purposes?"

Yes, it can. It misses some post-processing routine for replacing unknown tokens and very large target vocabulary extension at the moment, but even without them, you can get a decent NMT model with the very code in those repos.

- "In general, can a good NMT model be trained on a GTX (980) with 4Gibs of RAM, or these models need so many parameters they require a cluster of Titan Xs?"

Yes, you can, but each model needs to be quite small and won't perform well. Instead, you can train multiple small models (each on GTX980) and make an ensemble of them.

- "During training however, what is the input to the decoder?"

It's the ground truth you feed in, because that is how it's supposed to be if the log-likelihood is maximized (check out the first post.) However, this does not mean that this is the only or best way. See, for instance, http://arxiv.org/abs/1506.0...

Cheers,
- K

Thank You so much for the reply and the link.

It's too bad my new high-end GPU is still not good enough for deep learning, but I hope it'll suffice for educational purposes.

Thanks again,
Regards,
Marcin

Very great post!
I have two questions.
1. We find NMT gets very good performance compared with SMT for only several years. But we know much linguistic knowledge play a key role in NLP and SMT. How do you think about gaining profit by integrating the linguistic knowledge into NMT?
2. Another question is that whether NMT can not limited by the expensive parallel corpus, but fully utilizing many comparable corpus? Because NMT models the source sentence as an abstractive representation, and doesn't actually model the word or phrase relationship between languages like SMT.

Hi Xiang,

1. "How do you think about gaining profit by integrating the linguistic knowledge into NMT?"

I believe that the described NMT models already have captured those linguistic knowledge that's necessary for well performing translation. However, I also believe that the existing linguistic knowledge can be used as guiding signal during training. For instance, it may speed up the convergence of training by augmenting the encoder with additional classifiers or structured output predictors to predict certain linguistic properties of a source sentence that are deemed important. This kind of giving out hints to make learning easier has been tried in, e.g., http://arxiv.org/abs/1301.4083. So, not as input but as additional target.

2. "Another question is that whether NMT can not limited by the expensive parallel corpus, but fully utilizing many comparable corpus?"

Yes, I agree with you that it's important to utilize (almost infinite) monolingual corpora. However, it is not clear how it should be done in NMT. In fact, we here at Montreal jointly with our collaborators in France proposed one such way, called 'deep fusion', in http://arxiv.org/abs/1503.0.... Though, I need to tell you: this paper has been rejected twice already from ACL'15 and EMNLP'15.. :(

Cheers,
- K

Do You think that in the nearest future it will be possible to utilize "somewhat parallel" corpora, like translated books? Perhaps by forcing two language models to produce similar paragraph vectors or something like this?

Well, this I have no answer to. Of course, I hope one day there will be no need for strict sentence/paragraph alignment, but it's quite unclear how it'll happen.

Hi Kyunghyun, thanks for sharing the codes. I've some questions on these lines: https://github.com/kyunghyu... . What is `n_words_source` and what is the list comprehension and line 48 - 50 doing. Could you care to explain in brief?

Also, what does `source` at https://github.com/kyunghyu... refer to? Is that the list of frequencies of the words?

Once again, thanks for sharing the code. It's great to learn from the code and I've learnt much and still learning more from them! I wish everyone releases their code for any paper/tutorials no matter how "raw" they are. They are valuable documentation and great learning points!

Hi Liling,

1. "What is `n_words_source`"

That is the maximum size of the source vocabulary. Any word with an index larger than this max size will be considered an unknown word (too rare.)

2. "what does `source` at https://github.com/kyunghyunch... refer to?"

that's a list of sentences in a minibatch.

Cheers,
- K

Thanks for the great post!
In figure 7, why RNNsearch-30 still suffer from the problem of performance degradation with respect to the sentence length?

HI Kyunghyun, thanks for the specific explanations to the code!!

Thanks for the posts. I am trying to replicate your sessions from the git repo, the datasets you are using seems to be missing from the repo. Do you know where I can get these sets?

Thanks

Thanks for great post, and excellent codes.

I am trying to apply attention model to other NLP tasks.

What should I do If i want to specify or constraint number of target sequence in prediction? (say 1 or same number as input sequence)

Hello! Thanks for great introduction!

I have a question:
What is the exact form of mixing "glimpses" generated by attention network and recurrent ( GRU for example ) decoding network?

Thanks for very nice post!
I am very interested in your opinion about beyond modeling sentences. You said, "
Learning should probably be local, and weights should be updated online while processing a sequence."
I have a question, what's "weights should be updated online while processing a sequence"? Could you explain it more?

Hi,

Thanks so much for your very interesting illustrations. However, I wish you could answer my questions below.

1) In the "encoder" phase of the network, what are the targets used during training to bring up the fixed sentence representation? Do you use for example a fixed special word as a target to indicate "null" output at this phase, or you also train to produce the next word in sequence like in language modeling ? or something else? I am really very interested to know.

2) For the simpler model without attention mechanism, or even for the more complicated ones with the attention mechanism, is the encoder and decoder trained at the same time in one shot? or indeed they are two separate phases? I assume they are trained in one shot to optimize the log likelihood criterion that imposes the probability of target seq given the source seq, is that right?

3) When using the encoder-decoder network in translation mode (I mean not in training mode) in order to produce a target translation of a given source sentence.
We start by reading the source sentence sequentially until we produce its representation, then we get into the decoder phase, here the input that we need to feed in is a sample of the output, right? do you select highest probable word from the output layer of the decoder? or you employ multiple N best words in a beam search strategy? please explain, and what is the stopping condition during this search?

4) can you please give an idea how much accuracy you got using your approaches

compared to the best SMT and the best earlier NMT method?

Thank you so so much for your great contribution!!
All the best,

Amr Mousa

"In the "encoder" phase of the network, what are the targets used during training to bring up the fixed sentence representation?"

There is no target in the encoder network.

"For the simpler model without attention mechanism, or even for the more complicated ones with the attention mechanism, is the encoder and decoder trained at the same time in one shot?"

They are trained jointly without any pretraining. The separation between the encoder and decoder is rather conceptual.

"3)"

During training, you feed in the correct words from the training data set (as dictated by maximum likelihood estimation.) During test time, you feed in the word selected from the previous time step. In order to do better (approximate) decoding, it is usual to use beam search.

"can you please give an idea how much accuracy you got using your approaches compared to the best SMT and the best earlier NMT method?"

I suggest you to Table 1 of http://arxiv.org/abs/1507.0...