Tensorflow programs crashes:

Hi there,

I posted this originally here:
https://devtalk.nvidia.com/default/topic/1020336/jetson-tx1/r28-1-compiz-logout-error/2

Sorry for duplicate.

My simple distributed tensorflow program fails on jetson tx2 with jetpack 3.2 and this tensorflow:
https://nvidia.app.box.com/v/TF170-py27-wTRT

Here is a code that “almost” always reproduces the issue:

import argparse
import numpy as np
import tensorflow as tf

worker_hosts = [
    "localhost:1111",
    "localhost:2222"
]
parser = argparse.ArgumentParser()
parser.add_argument(
    "--job_name",
    type=str,
    default="",
    help="Create a worker server"
)

parser.add_argument(
    "--task_index",
    type=int,
    default=0,
    help="Index of task within the workers"
)
args = parser.parse_args()
cluster_spec = tf.train.ClusterSpec({"worker": worker_hosts})
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.1)
server = tf.train.Server(cluster_spec,
                         job_name="worker",
                         task_index=args.task_index,
                         config=tf.ConfigProto(gpu_options=gpu_options))
if args.job_name == "":
    size = 1024
    X = tf.placeholder(tf.float32, [None, size], name="X")
    W = tf.Variable(tf.ones(), name="W")
    with tf.device("/job:worker/replica:0/task:1"):
        M = tf.matmul(X, W, name="XxY")
    B = tf.Variable(tf.ones((1, size)), name="B")
    R = tf.reduce_sum(M + B)
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True,
                                            gpu_options=gpu_options),
                           target='grpc://localhost:1111')
    sess.run(tf.global_variables_initializer())
    res = sess.run(R, feed_dict={X: np.ones((1, size))})
    print("Total: %d" % res)
    assert res == 1024 * 1025
else:
    server.join()

First run “worker” process in one shell:

$ python test_jetson_tx2.py --job_name=worker --task_index=1

Then start “main” process with:

$ python test_jetson_tx2.py

$ dmesg

[17590.827885] CPU: 4 PID: 15747 Comm: python Not tainted 4.4.38-tegra #1
[17590.834416] Hardware name: quill (DT)
[17590.838083] task: ffffffc019351900 ti: ffffffc10cbf4000 task.ti: ffffffc10cbf4000
[17590.845566] PC is at 0x7faafbe2b0
[17590.848886] LR is at 0x7faafbe27c
[17590.852204] pc : [<0000007faafbe2b0>] lr : [<0000007faafbe27c>] pstate: 60000000
[17590.859596] sp : 0000007f70c040a0
[17590.862915] x29: 0000007f70c040a0 x28: 0000000000000001
[17590.868255] x27: 0000000000000005 x26: 0000000000000007
[17590.873590] x25: 0000007fb3562998 x24: 0000000000000002
[17590.878923] x23: 000000002dd2b5a7 x22: 0000007f70c04210
[17590.884262] x21: 0000007fb3559c10 x20: 0000007f70c04140
[17590.889597] x19: 0000007fb3562000 x18: 0000000000000014
[17590.894933] x17: 0000007fb4b2c760 x16: 0000007fb34ef250
[17590.900270] x15: 001dcd6500000000 x14: 0005ea2d20000000
[17590.905610] x13: 0000000000003680 x12: 0000000040360000
[17590.910956] x11: 0000000000000010 x10: 0101010101010101
[17590.916305] x9 : 0000007f340011a0 x8 : 0000007f70c04168
[17590.921643] x7 : 0000007fb3422d60 x6 : 000000001b873593
[17590.926984] x5 : 0000000000000073 x4 : 0000007f70c04120
[17590.932327] x3 : 0000000000000063 x2 : 000000000d84ddc0
[17590.937662] x1 : 0000000000000008 x0 : 00000000016e95ad

Many thanks in advance, any help would be greatly appreciated!

Hi,

Could you test your application without using threads.
Please share the results with us.

Thanks.

Hi AastaLLL,

Thanks for looking into this!
If we remove second host in worker_hosts and pin matrix multiplication to task 0, then it works fine.
The above code also works fine on my laptop without any modifications at all.

Here is modified code with one worker for your convenience:

import argparse
import numpy as np
import tensorflow as tf

worker_hosts = [
    "localhost:1111"
    #"localhost:2222"
]
parser = argparse.ArgumentParser()
parser.add_argument(
    "--job_name",
    type=str,
    default="",
    help="Create a worker server"
)

parser.add_argument(
    "--task_index",
    type=int,
    default=0,
    help="Index of task within the workers"
)
args = parser.parse_args()
cluster_spec = tf.train.ClusterSpec({"worker": worker_hosts})
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.1)
server = tf.train.Server(cluster_spec,
                         job_name="worker",
                         task_index=args.task_index,
                         config=tf.ConfigProto(gpu_options=gpu_options))
if args.job_name == "":
    size = 1024
    X = tf.placeholder(tf.float32, [None, size], name="X")
    W = tf.Variable(tf.ones(), name="W")
    with tf.device("/job:worker/replica:0/task:0"):
        M = tf.matmul(X, W, name="XxY")
    B = tf.Variable(tf.ones((1, size)), name="B")
    R = tf.reduce_sum(M + B)
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True,
                                            gpu_options=gpu_options),
                           target='grpc://localhost:1111')
    sess.run(tf.global_variables_initializer())
    res = sess.run(R, feed_dict={X: np.ones((1, size))})
    print("Total: %d" % res)
    assert res == 1024 * 1025
else:
    server.join()

Hi,

TensorFlow has API for multi-threading.
Could you follow the instruction to rewrite your program?
https://www.tensorflow.org/versions/master/api_guides/python/threading_and_queues

Thanks.

Hello,

My intention is to distributed training among three jetsons. I try to follow this tutorial:
https://www.tensorflow.org/deploy/distributed

The program crashes when run with at least two workers on jetsons. The crash is reproducible with the posted code.
Is there is something wrong with this code?
Could you please elaborate why you are asking me to rewrite this code using threads?

Hi,

It’s not recommended to run training on Jetson since it is designed for inferencing.
With the data transmission between devices, the performance of TensorFlow may not be good.

Could we know the reason why you want to apply training on Jetson?
Thanks.

Hi again,

I am playing with three jetsons I have to see if I can learn to optimally control my device in real time.
I think data transmission is not a problem in my setup, since each camera can process its image locally and just
send to “master” node the “compressed” output of CNN for final decision making. Similar, only few thousands variables should
be send to each camera during gradient calculation.

I am also happy with performance of tensorflow training my model on single Jetson.

The only problem that blocked me is this failure.
Could you please try to reproduce it?

Thank you for your help!

Hi,

Sorry, we don’t have too much experience with TensorFlow training on Jetson.
Let us give it a try and update information with you later.

Thanks.

Hi,

After removing the device constraint of tf.matmul, we can run your program successfully.

diff --git a/test_jetson_tx2.py b/test_jetson_tx2.py
index 9f4e068..2c81369 100644
--- a/test_jetson_tx2.py
+++ b/test_jetson_tx2.py
@@ -31,8 +31,8 @@ if args.job_name == "":
     size = 1024
     X = tf.placeholder(tf.float32, [None, size], name="X")
     W = tf.Variable(tf.ones(), name="W")
-    with tf.device("/job:worker/replica:0/task:1"):
-        M = tf.matmul(X, W, name="XxY")
+#   with tf.device("/job:worker/replica:0/task:1"):
+    M = tf.matmul(X, W, name="XxY")
     B = tf.Variable(tf.ones((1, size)), name="B")
     R = tf.reduce_sum(M + B)
     sess = tf.Session(config=tf.ConfigProto(log_device_placement=True,

Guess that error is caused by invalid placement for Tegra.
Could you check it again?

Thanks.

It seems everything works fine with TF 1.8

Thanks a lot for the help!