Introduction to Neural Machine Translation with GPUs (Part 2)

they are not trained separately but jointly.

depends on which type of recurrent activation function you use.

Well, as i understand, decoding occurs during training and work. Could you please clarify what is MLE training?

you do decoding during test, but don't during training if maximum likelihood training is used. mle is explained in this article already.

I completely confused. Encoder and decoder are trained jointly. But how to train a decoder without decoding?

Thank you for this informative post. Although I don't understand the part for calculating output scores. I think you meant the output layer's Weight, by W. (just under Figure. 8)

Hi, I am also a reader of this post.

Mr. Cho mentioned W as a 'target word vector' that appears to be weight parameters in the softmax output layer. Since we need the number of outputs from the softmax ouput layer (for each time unit, i.e., for each word in the target sentence) to be the same as the total number of unique target words (i.e., the size of the target words' vocabulary), we may consider that each combination of weight parameters to a specific output node (here, indexed by k) as target word vector......