Introduction to Neural Machine Translation with GPUs (Part 2)

anon9601907 · April 6, 2017, 1:29pm

they are not trained separately but jointly.

anon9601907 · April 6, 2017, 1:29pm

depends on which type of recurrent activation function you use.

anon6958328 · April 6, 2017, 1:36pm

Well, as i understand, decoding occurs during training and work. Could you please clarify what is MLE training?

anon9601907 · April 6, 2017, 1:51pm

you do decoding during test, but don't during training if maximum likelihood training is used. mle is explained in this article already.

anon6958328 · April 6, 2017, 2:09pm

I completely confused. Encoder and decoder are trained jointly. But how to train a decoder without decoding?

anon47739875 · April 9, 2018, 1:57am

Thank you for this informative post. Although I don't understand the part for calculating output scores. I think you meant the output layer's Weight, by W. (just under Figure. 8)

anon53431993 · June 11, 2018, 7:23am

Hi, I am also a reader of this post.

Mr. Cho mentioned W as a 'target word vector' that appears to be weight parameters in the softmax output layer. Since we need the number of outputs from the softmax ouput layer (for each time unit, i.e., for each word in the target sentence) to be the same as the total number of unique target words (i.e., the size of the target words' vocabulary), we may consider that each combination of weight parameters to a specific output node (here, indexed by k) as target word vector......