Keras on Jetson TK1

Hi All, I have followed instructions to install TensorFlow (0.8) on my TK1 and have run mnist on that. Now I want to use Keras with that. I see that Keras 1.0.4 is required for TF 0.8 and have installed using pip. However, Keras accuracy is VERY low (10-11%) as compared to my x86 system and the accuracy does not change whether I use uint8, float16, or float32. Its also low compared to the TF run. Does anyone have any guidance on using Keras? A second question – how can I tell if Keras and/or TF is using the GPU? Thanks.

Hi,

Please use tegrastate to check GPU utilization.

sudo ./tegrastats

Please run inference directly on tensorFlow to check if their accuracy also degrade.

Thanks.

Hi, with an upgrade to JetPack 3.0 I can now see that both Keras and TF are using the GPU w/ tegrastats, however whereas TF mnist example gives 92% accuracy, the Keras 1.0.4 mnist_cnn.py example is 11% accuracy. Any further ideas would be helpful. Since there are 10 digits in mnist perhaps the classifier is not training and just random – 1/10 == 10%? Thanks.

Hi,

We will check this issue and update information to you later.
Thanks.

Thank you. I did some additional testing and find that when I run mnist_cnn.py from the Keras 1.0.4 examples using the CPU ONLY I get the expected accuracy. It obviously takes a long time to run and I confirm that the GPU is not being used via tegrastats … see the first epoch result below:

Using TensorFlow backend.
X_train shape: (60000, 1, 28, 28)
60000 train samples
10000 test samples
E tensorflow/stream_executor/cuda/cuda_driver.cc:481] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:102] kernel driver does not appear to be running on this host (tegra-ubuntu): /proc/driver/nvidia/version does not exist
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
59904/60000 [============================>.] - ETA: 4s - loss: 0.3842 - acc: 0.8836


So to summarize:
JetPack 3.0
TF 0.8
Keras 1.0.4
CUDA mnist example gives expected accuracy and confirmed running on GPU
TF mnist example gives expected accuracy and confirmed running on GPU
Keras mnist_cnn.py example gives expected accuracy using TF as backend ONLY WITH CPU ONLY ENABLED
==>> Keras mnist_cnn.py example DOES NOT GIVE expected accuracy using TF as backend and TF running CUDA (confirmed w/ tegrastats)

Hi, are there any updates to this issue?

Hi,

Thanks for your patience.

Do you install keras from source or via apt-get?
Looks like that keras doesn’t recognize tegra gpu.

No, I installed using pip install with keras 1.0.4. Keras has dependencies on other packages so I did a sudo apt-get install libblas-dev liblapack-dev libhdf5-dev gfortran and then used pip to install:
scikit-learn
scipy
numpy
sklearn
h5py
Pillow
Theano
TensorFlow

keras uses TF 0.8 as a backend which I did install using a native build to enable for CUDA on the TK1 (this process has been documented elsewhere).

I would like to test this using a Theano backend to see if it has the same issue but as other people have noted Theano native build would require pygpu and that requires pkgs not available on TK1, e.g. cmake … see: https://devtalk.nvidia.com/default/topic/1008898/how-to-configure-gpu-for-theano-on-jetson-tk1

Hi,

Guess that there is some extra options need to be added to make Keras recognize Tegra GPU.
Do you know where we can find Keras source to check this?

Thanks.

Hi, here is the link to the Keras project on github.

Make sure you are looking at the Keras 1 branch.

Keras is a python wrapper over TF. Therefore if TF recognizes the Tegra GPU, Keras will run on the GPU.

I think the best way forward is to actually have someone from nVidia get Theano working on the Tegra GPU building from scratch, that would help isolate if this is an interaction between Keras and TF or its ubiquitous for Keras on either Theano or TF.

Hi,

Sorry for the late reply.

We suspect Keras more since pure tensorflow test is good.
GPU may bypass kernel if the configuration is wrong.
As a results, the output won’t be updated and the accuracy will drop to 0.1. (random guess)

But unfortunately, we now stuck by compiling tensorflow since the segmentation fault in Eigen library.
We will keep debugging this and also search if there is pre-built whirl for tx1.

At the same time, could you help us to try following code and check if the output is correct?

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

def weight_varible(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')


mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
print("Download Done!")

sess = tf.InteractiveSession()

# paras
W_conv1 = weight_varible([5, 5, 1, 32])
b_conv1 = bias_variable([32])

# conv layer-1
x = tf.placeholder(tf.float32, [None, 784])
x_image = tf.reshape(x, [-1, 28, 28, 1])

h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

# conv layer-2
W_conv2 = weight_varible([5, 5, 32, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

# full connection
W_fc1 = weight_varible([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

# dropout
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

# output layer: softmax
W_fc2 = weight_varible([1024, 10])
b_fc2 = bias_variable([10])

y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
y_ = tf.placeholder(tf.float32, [None, 10])

# model training
cross_entropy = -tf.reduce_sum(y_ * tf.log(y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

correct_prediction = tf.equal(tf.arg_max(y_conv, 1), tf.arg_max(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

sess.run(tf.initialize_all_variables())

for i in range(20000):
    batch = mnist.train.next_batch(50)

    if i % 100 == 0:
        train_accuacy = accuracy.eval(feed_dict={x: batch[0], y_: batch[1], keep_prob: 1.0})
        print("step %d, training accuracy %g"%(i, train_accuacy))
    train_step.run(feed_dict = {x: batch[0], y_: batch[1], keep_prob: 0.5})

# accuacy on test
print("test accuracy %g"%(accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0})))

From https://github.com/fchollet/keras/issues/3508, another possible reason is from tensorflow but only occurs in certain ways of task.

Hi, as I previously posted the mnist example from Tensor Flow runs correctly, and I have determined it uses the GPU. Thanks.

Also, there are a few existing whl’s out there for TF 0.8 for the TK1. There are also multiple posts on how to build the whl.

Hi,

If possible, please give comment #11 a try.

Although it is also a MNist training task but it’s the way Keras implemented.
From the link, user have good accuracy in tensorflow default MNIST sample but got low accuracy in this script.

It will help us narrow down the problem.

Thanks.

Here is the output from that run.
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:875] ARMV7 does not support NUMA - returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GK20A
major: 3 minor: 2 memoryClockRate (GHz) 0.852
pciBusID 0000:00:00.0
Total memory: 1.85GiB
Free memory: 929.62MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GK20A, pci bus id: 0000:00:00.0)
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Download Done!
step 0, training accuracy 0.04
step 100, training accuracy 0.06
step 200, training accuracy 0.04
step 300, training accuracy 0.16
step 400, training accuracy 0.08
step 500, training accuracy 0.1
step 600, training accuracy 0.08
step 700, training accuracy 0.1
step 800, training accuracy 0.14
step 900, training accuracy 0.14
step 1000, training accuracy 0.04
step 1100, training accuracy 0.16
step 1200, training accuracy 0.12
step 1300, training accuracy 0.12
step 1400, training accuracy 0.12
step 1500, training accuracy 0.12
step 1600, training accuracy 0.08
step 1700, training accuracy 0.1
:step 1800, training accuracy 0.12
step 1900, training accuracy 0.14
step 2000, training accuracy 0.04
step 2100, training accuracy 0.2
step 2200, training accuracy 0.06
step 2300, training accuracy 0.08
step 2400, training accuracy 0.12
step 2500, training accuracy 0.06
step 2600, training accuracy 0.08
step 2700, training accuracy 0.16
step 2800, training accuracy 0.1
step 2900, training accuracy 0.06
step 3000, training accuracy 0.14
step 3100, training accuracy 0.14
step 3200, training accuracy 0.18
step 3300, training accuracy 0.12
step 3400, training accuracy 0.1
step 3500, training accuracy 0.12
step 3600, training accuracy 0.16
step 3700, training accuracy 0.16
step 3800, training accuracy 0.06
step 3900, training accuracy 0.14
step 4000, training accuracy 0.08
step 4100, training accuracy 0.14
step 4200, training accuracy 0.08
step 4300, training accuracy 0.06
Traceback (most recent call last):
File “./tf_mnist_test_from_nv.py”, line 70, in
batch = mnist.train.next_batch(50)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py”, line 154, in next_batch
self._images = self._images[perm]
MemoryError

Hi,

Looks like root cause should be tensorFlow.
Your memory is quite low. Could you add some swap space?

Total memory: 1.85GiB
Free memory: 929.62MiB

I thought I created an 8GB swap but got this:
step 19900, training accuracy 0.12
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (256): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (512): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (1024): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (2048): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (4096): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (8192): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (16384): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (32768): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (65536): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (131072): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (262144): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (524288): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (1048576): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (2097152): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (4194304): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (8388608): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (16777216): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (33554432): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (67108864): Total C.
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (134217728): .
I tensorflow/core/common_runtime/bfc_allocator.cc:635] Bin (268435456): .
I tensorflow/core/common_runtime/bfc_allocator.cc:652] Bin for 957.03MiB was 25
I tensorflow/core/common_runtime/bfc_allocator.cc:658] Size: 911.35MiB | Requ1
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4eb3f000 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4eb3f100 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4eb3f200 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4eb3f300 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4eb3f400 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4eb40400 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4eb40500 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4eb40600 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4eb40700 of s8
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4eb41400 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4eb41500 of s0
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4eb73500 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4eb73600 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7b3600 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7b4600 of s0
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7be600 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7be700 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7be800 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7be900 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7bf900 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7bfa00 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7bfb00 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7bfc00 of s8
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7c0900 of s8
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7c1600 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7c1700 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7c1800 of s0
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f7f3800 of s0
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f825800 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f825900 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x4f825a00 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x50465a00 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510a5a00 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510a6a00 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510a7a00 of s0
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510b1a00 of s0
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510bba00 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510bbb00 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510bbc00 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510bbd00 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510bbe00 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510bbf00 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510bc000 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510bc100 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510bc200 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510bc300 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510bc400 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510bc900 of s8
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510c7600 of s0
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x510ef600 of s0
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x51121600 of s8
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x51d61600 of s6
I tensorflow/core/common_runtime/bfc_allocator.cc:670] Chunk at 0x529a1600 of s0
I tensorflow/core/common_runtime/bfc_allocator.cc:679] Free at 0x510bc500 of si4
I tensorflow/core/common_runtime/bfc_allocator.cc:679] Free at 0x510bd600 of si0
I tensorflow/core/common_runtime/bfc_allocator.cc:679] Free at 0x510d1600 of si0
I tensorflow/core/common_runtime/bfc_allocator.cc:679] Free at 0x51183100 of si8
I tensorflow/core/common_runtime/bfc_allocator.cc:679] Free at 0x54789a00 of si6
I tensorflow/core/common_runtime/bfc_allocator.cc:685] Summary of in-use C
I tensorflow/core/common_runtime/bfc_allocator.cc:688] 30 Chunks of size 256 toB
I tensorflow/core/common_runtime/bfc_allocator.cc:688] 4 Chunks of size 3328 toB
I tensorflow/core/common_runtime/bfc_allocator.cc:688] 5 Chunks of size 4096 toB
I tensorflow/core/common_runtime/bfc_allocator.cc:688] 4 Chunks of size 40960 tB
I tensorflow/core/common_runtime/bfc_allocator.cc:688] 4 Chunks of size 204800 B
I tensorflow/core/common_runtime/bfc_allocator.cc:688] 1 Chunks of size 400128 B
I tensorflow/core/common_runtime/bfc_allocator.cc:688] 4 Chunks of size 1284505B
I tensorflow/core/common_runtime/bfc_allocator.cc:688] 1 Chunks of size 3136000B
I tensorflow/core/common_runtime/bfc_allocator.cc:692] Sum Total of in-use chunB
I tensorflow/core/common_runtime/bfc_allocator.cc:694] Stats:
Limit: 1052393472
InUse: 84164864
MaxInUse: 84222976
NumAllocs: 2167865
MaxAllocSize: 31360000

W tensorflow/core/common_runtime/bfc_allocator.cc:270] ****Xiϱ
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying.
W tensorflow/core/framework/op_kernel.cc:900] Resource exhausted: OOM when allo]
Traceback (most recent call last):
File “./tf_mnist_test_from_nv.py”, line 78, in
print(“test accuracy %g”%(accuracy.eval(feed_dict={x: mnist.test.images, y_)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.l
return _eval_using_default_session(self, feed_dict, self.graph, session)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.n
return session.run(tensors, feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/sessionn
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/sessionn
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/sessionn
target_list, options, run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/sessionl
e.code)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating ]
[[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format=“NHWC”, padding=“SAME”]
[[Node: Mean/_9 = _Recv[client_terminated=false, recv_device=”/job:loc]
Caused by op u’Conv2D’, defined at:
File “./tf_mnist_test_from_nv.py”, line 32, in
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
File “./tf_mnist_test_from_nv.py”, line 13, in conv2d
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding=‘SAME’)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_opsd
data_format=data_format, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_libp
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.p
original_op=self.default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.

self._traceback = _extract_stack()

can you do this for like 10k or 5k steps and compare results???

Hi,

Sorry for the late reply.
Just want to confirm that do you build tensorflow with architecture 3.2?

Yes, I have followed the instructions found: http://cudamusing.blogspot.com/2015/11/building-tensorflow-for-jetson-tk1.html and as previously reported mnist on TF from the TF examples works fine.

At this point, I would suggest that someone from nV actually build the TF wheel via these instructions as well as build Theano (since it would be helpful to test BOTH backends as comparison) and debug the mninst example that was posted here – seemingly mimicking the Keras commands. Question – does this example work on a.) ARM only and b.) other CUDA GPU systems. Again, I suggest only running thru 10k iterations to make it easier to run on TK1 with swap.

Any updates on this issue or getting Theano running on TK1?