Keras on Jetson TK1

Hi All, I have followed instructions to install TensorFlow (0.8) on my TK1 and have run mnist on that. Now I want to use Keras with that. I see that Keras 1.0.4 is required for TF 0.8 and have installed using pip. However, Keras accuracy is VERY low (10-11%) as compared to my x86 system and the accuracy does not change whether I use uint8, float16, or float32. Its also low compared to the TF run. Does anyone have any guidance on using Keras? A second question – how can I tell if Keras and/or TF is using the GPU? Thanks.


Please use tegrastate to check GPU utilization.

sudo ./tegrastats

Please run inference directly on tensorFlow to check if their accuracy also degrade.


Hi, with an upgrade to JetPack 3.0 I can now see that both Keras and TF are using the GPU w/ tegrastats, however whereas TF mnist example gives 92% accuracy, the Keras 1.0.4 example is 11% accuracy. Any further ideas would be helpful. Since there are 10 digits in mnist perhaps the classifier is not training and just random – 1/10 == 10%? Thanks.


We will check this issue and update information to you later.

Thank you. I did some additional testing and find that when I run from the Keras 1.0.4 examples using the CPU ONLY I get the expected accuracy. It obviously takes a long time to run and I confirm that the GPU is not being used via tegrastats … see the first epoch result below:

Using TensorFlow backend.
X_train shape: (60000, 1, 28, 28)
60000 train samples
10000 test samples
E tensorflow/stream_executor/cuda/] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/] kernel driver does not appear to be running on this host (tegra-ubuntu): /proc/driver/nvidia/version does not exist
I tensorflow/core/common_runtime/gpu/] No GPU devices available on machine.
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
59904/60000 [============================>.] - ETA: 4s - loss: 0.3842 - acc: 0.8836

So to summarize:
JetPack 3.0
TF 0.8
Keras 1.0.4
CUDA mnist example gives expected accuracy and confirmed running on GPU
TF mnist example gives expected accuracy and confirmed running on GPU
Keras example gives expected accuracy using TF as backend ONLY WITH CPU ONLY ENABLED
==>> Keras example DOES NOT GIVE expected accuracy using TF as backend and TF running CUDA (confirmed w/ tegrastats)

Hi, are there any updates to this issue?


Thanks for your patience.

Do you install keras from source or via apt-get?
Looks like that keras doesn’t recognize tegra gpu.

No, I installed using pip install with keras 1.0.4. Keras has dependencies on other packages so I did a sudo apt-get install libblas-dev liblapack-dev libhdf5-dev gfortran and then used pip to install:

keras uses TF 0.8 as a backend which I did install using a native build to enable for CUDA on the TK1 (this process has been documented elsewhere).

I would like to test this using a Theano backend to see if it has the same issue but as other people have noted Theano native build would require pygpu and that requires pkgs not available on TK1, e.g. cmake … see:


Guess that there is some extra options need to be added to make Keras recognize Tegra GPU.
Do you know where we can find Keras source to check this?


Hi, here is the link to the Keras project on github.

Make sure you are looking at the Keras 1 branch.

Keras is a python wrapper over TF. Therefore if TF recognizes the Tegra GPU, Keras will run on the GPU.

I think the best way forward is to actually have someone from nVidia get Theano working on the Tegra GPU building from scratch, that would help isolate if this is an interaction between Keras and TF or its ubiquitous for Keras on either Theano or TF.


Sorry for the late reply.

We suspect Keras more since pure tensorflow test is good.
GPU may bypass kernel if the configuration is wrong.
As a results, the output won’t be updated and the accuracy will drop to 0.1. (random guess)

But unfortunately, we now stuck by compiling tensorflow since the segmentation fault in Eigen library.
We will keep debugging this and also search if there is pre-built whirl for tx1.

At the same time, could you help us to try following code and check if the output is correct?

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

def weight_varible(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
print("Download Done!")

sess = tf.InteractiveSession()

# paras
W_conv1 = weight_varible([5, 5, 1, 32])
b_conv1 = bias_variable([32])

# conv layer-1
x = tf.placeholder(tf.float32, [None, 784])
x_image = tf.reshape(x, [-1, 28, 28, 1])

h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

# conv layer-2
W_conv2 = weight_varible([5, 5, 32, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

# full connection
W_fc1 = weight_varible([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

# dropout
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

# output layer: softmax
W_fc2 = weight_varible([1024, 10])
b_fc2 = bias_variable([10])

y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
y_ = tf.placeholder(tf.float32, [None, 10])

# model training
cross_entropy = -tf.reduce_sum(y_ * tf.log(y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

correct_prediction = tf.equal(tf.arg_max(y_conv, 1), tf.arg_max(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

for i in range(20000):
    batch = mnist.train.next_batch(50)

    if i % 100 == 0:
        train_accuacy = accuracy.eval(feed_dict={x: batch[0], y_: batch[1], keep_prob: 1.0})
        print("step %d, training accuracy %g"%(i, train_accuacy)) = {x: batch[0], y_: batch[1], keep_prob: 0.5})

# accuacy on test
print("test accuracy %g"%(accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0})))

From Very low accuracy in the mnist_cnn when running on a GPU using tensorflow backend · Issue #3508 · keras-team/keras · GitHub, another possible reason is from tensorflow but only occurs in certain ways of task.

Hi, as I previously posted the mnist example from Tensor Flow runs correctly, and I have determined it uses the GPU. Thanks.

Also, there are a few existing whl’s out there for TF 0.8 for the TK1. There are also multiple posts on how to build the whl.


If possible, please give comment #11 a try.

Although it is also a MNist training task but it’s the way Keras implemented.
From the link, user have good accuracy in tensorflow default MNIST sample but got low accuracy in this script.

It will help us narrow down the problem.


Here is the output from that run.
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/cuda/] ARMV7 does not support NUMA - returning NUMA node zero
I tensorflow/core/common_runtime/gpu/] Found device 0 with properties:
name: GK20A
major: 3 minor: 2 memoryClockRate (GHz) 0.852
pciBusID 0000:00:00.0
Total memory: 1.85GiB
Free memory: 929.62MiB
I tensorflow/core/common_runtime/gpu/] DMA: 0
I tensorflow/core/common_runtime/gpu/] 0: Y
I tensorflow/core/common_runtime/gpu/] Creating TensorFlow device (/gpu:0) → (device: 0, name: GK20A, pci bus id: 0000:00:00.0)
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Download Done!
step 0, training accuracy 0.04
step 100, training accuracy 0.06
step 200, training accuracy 0.04
step 300, training accuracy 0.16
step 400, training accuracy 0.08
step 500, training accuracy 0.1
step 600, training accuracy 0.08
step 700, training accuracy 0.1
step 800, training accuracy 0.14
step 900, training accuracy 0.14
step 1000, training accuracy 0.04
step 1100, training accuracy 0.16
step 1200, training accuracy 0.12
step 1300, training accuracy 0.12
step 1400, training accuracy 0.12
step 1500, training accuracy 0.12
step 1600, training accuracy 0.08
step 1700, training accuracy 0.1
:step 1800, training accuracy 0.12
step 1900, training accuracy 0.14
step 2000, training accuracy 0.04
step 2100, training accuracy 0.2
step 2200, training accuracy 0.06
step 2300, training accuracy 0.08
step 2400, training accuracy 0.12
step 2500, training accuracy 0.06
step 2600, training accuracy 0.08
step 2700, training accuracy 0.16
step 2800, training accuracy 0.1
step 2900, training accuracy 0.06
step 3000, training accuracy 0.14
step 3100, training accuracy 0.14
step 3200, training accuracy 0.18
step 3300, training accuracy 0.12
step 3400, training accuracy 0.1
step 3500, training accuracy 0.12
step 3600, training accuracy 0.16
step 3700, training accuracy 0.16
step 3800, training accuracy 0.06
step 3900, training accuracy 0.14
step 4000, training accuracy 0.08
step 4100, training accuracy 0.14
step 4200, training accuracy 0.08
step 4300, training accuracy 0.06
Traceback (most recent call last):
File “./”, line 70, in
batch = mnist.train.next_batch(50)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/”, line 154, in next_batch
self._images = self._images[perm]


Looks like root cause should be tensorFlow.
Your memory is quite low. Could you add some swap space?

Total memory: 1.85GiB
Free memory: 929.62MiB

I thought I created an 8GB swap but got this:
step 19900, training accuracy 0.12
can you do this for like 10k or 5k steps and compare results???


Sorry for the late reply.
Just want to confirm that do you build tensorflow with architecture 3.2?

Yes, I have followed the instructions found: CUDA Musing: Building TensorFlow for Jetson TK1 and as previously reported mnist on TF from the TF examples works fine.

At this point, I would suggest that someone from nV actually build the TF wheel via these instructions as well as build Theano (since it would be helpful to test BOTH backends as comparison) and debug the mninst example that was posted here – seemingly mimicking the Keras commands. Question – does this example work on a.) ARM only and b.) other CUDA GPU systems. Again, I suggest only running thru 10k iterations to make it easier to run on TK1 with swap.

Any updates on this issue or getting Theano running on TK1?