Introduction to Neural Machine Translation with GPUs (Part 2)

Originally published at:

Note: This is part two of a detailed three-part series on machine translation with neural networks by Kyunghyun Cho. You may enjoy part 1 and part 3. In my previous post, I introduced statistical machine translation and showed how it can and should be viewed from the perspective of machine learning: as supervised learning where the input and output are…

Have you guys tried to use Neural Turing Machine or memory networks for this purpose? Or those models require much bigger training sets to produce better results?

Hi Kyunghun,
I was able to successfully train the session1 (session2) model on GPU (Nvidia GE force Titan X). However, I wasn't able to run the test script. It gives the following error:

MemoryError: ('Error allocating 60000000 bytes of device memory (initialization error).', "you might consider using 'theano.shared(..., borrow=True)'")
deviceval = type_support_filter(value, type.broadcastable, False, None)
MemoryError: ('Error allocating 60000000 bytes of device memory (initialization error).', "you might consider using 'theano.shared(..., borrow=True)'")

I noticed that the test script came with device=cpu and it runs fine on the CPU. Is there a way to make it work on Titan X?

Hi Kyunghyun ,

Thanks a lot for your explanation.

You wrote:


First, we score each word k given a hidden state z_i such that
e(k)=w_k^T z_i + b_k

where w_k and b_k are the (target) word vector and a bias, respectively.

I think there is a mistake here. The w_k and z_i are not vectors of the same size to use dot product, right? or did I miss something?
In contrast, I think you just mean to multiply the z_i by a weight matrix (of a softmax layer) and then add the bias. This is typically done before doing the softmax normalization, is this what you mean?

Thanks a lot,
Amr Mousa

Hi Amr,

w_k and z_i are indeed the vectors of the same size. w_k is simply a row vector of the output weight matrix, whose dimension is set according to the dimension of z_i.

- K

You can make it work on GPU, but the current script does not support multithreading with gpu. You can set -p 1 in the in order to use only a single thread, in which case running on GPU should be fine.

As translating each and every sentence is independent from each other, I prefer to use multithreading over using a single thread with GPU. You can for instance use 10 threads by -p 10.

I tried the change you suggested but I still get the same error. It seems the error is because it can not allocate 60MB of memory and likely not due to it not being a multi-threaded implementation:

Using gpu device 0: GeForce GTX TITAN X
Translating ../data/test2011/newstest2011.en.tok ...
Error when trying to find the memory information on the GPU: initialization error
Error allocating 60000000 bytes of device memory (initialization error). Driver report 0 bytes free and 0 bytes total
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/", line 258, in _bootstrap
File "/usr/lib/python2.7/multiprocessing/", line 114, in run
self._target(*self._args, **self._kwargs)
File "./", line 25, in translate_model
tparams = init_tparams(params)
File "......./dl4mt-material-master/session1/", line 63, in init_tparams
tparams[kk] = theano.shared(params[kk], name=kk)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/", line 208, in shared
allow_downcast=allow_downcast, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/", line 203, in float32_shared_constructor
deviceval = type_support_filter(value, type.broadcastable, False, None)
MemoryError: ('Error allocating 60000000 bytes of device memory (initialization error).', "you might consider using 'theano.shared(..., borrow=True)'")

Hi Kyunghyun,

Thank you very much for a clear explanation. Just one point I don't completely understand is what would be the value of z_0 when decoding (assuming that the first recurrent state we need is z_1). Because I can see that z_1 only has 2 inputs instead of 3 inputs like other z.

Thanks a lot.

Hi Chien,

We often initialize z_0 based on the output of the encoder. See, e.g.,

- K

suppose my utterance is : " How are you"
response is : " Iam Fine"

the final hidden state of the encoder has the information of the whole sentence "How are you" and we are sending that to the decoder.

In the paragraph,

"Now we have a probability distribution over the target words, which we can use to select a word by sampling the distribution (see here), as Figure 9 shows. After choosing the i-th word, we go back to the first step of computing the decoder’s internal hidden state (Figure 7), scoring and normalizing the target words (Figure 8) and selecting the next (i+1)-th word (Figure 9), repeating until we select the end-of-sentence word (<eos>)."

you are saying that "After choosing i th word" ,

Do you mean, you will just choose that word with highest probability and then convert it into a hot vector and then give it to the decode as the input (or) Do you mean, you will choose the word "Iam" (as here in example) with it's probability and convert this to a hot vector and input to the decoder?

Hi Harsha,

It'll depend on whether you're training or testing. If you're training to maximize the log-likelihood, you simply use the ground-truth word (e.g., "Iam" in your example.) If you're testing, you will either choose the most likely word so far (greedy decoding), sample from the conditional distribution (ancestral sampling) or use another more sophisticated decoding algorithm (e.g., beam search.)

For more about this, please refer to my lecture note at

- K

Hello Sir,

your answer greatly helped.

Thank you

could you kindly explain what is an E matrix and how to get it? And one more point to regard the z equation: you wrote that its non-linear function was described in previous post. Indeed, there were twi equations for RNN hidden state and for GRU hidden state. But i didn't find any appropriate description on how to obtain values for z. And, therefore, i didin't understand, how to compute z.did i miss something? Please clarify how to calculate z.

Thanks in advance.

E is a word embedding matrix, of which each row corresponds to a word vector. z is the hidden activation vector of a recurrent net and computed based on your choice of activation function (e.g., GRU).

Also kindly explain why it is necessary to sample from multinomial distribution? Why not to select just element with maximum probability? Also I've tried np.random.multinomial(...) for this purpose and it causes the whole model to fail gradient check.

Yes, this is clear to me. But I've asked about how does h_t corresponds to z_i? All illustrations show that h is passed to z at each decoder iteration. And every recurrent cell receives only two parameters: input vector (u in our case) and it's previous hidden vector (z_i-1 in our case). And I still can't realize how h is taken into account at each decoder step.

You can think of concatenating u and h_T every time step in the decoder.

sampling or greedy-max is only for decoding. for MLE training, you do not need to do either.

And therefore concatenated pair (u, h_T) is also concatenated with z_i-1 and passed to the decoder. Am i right?

Does decoder must be trained separate from encoder?