TX2 Tensorflow 1.10 Training error

kilichzf · August 25, 2018, 12:14pm

Hi
I m trying to train a mobile ssd on jetson tx2 with tensorflow object detection api
Since my question is long i will just intert my stackoverflow linek

[url]python - Tensorflow object detection api training error "TypeError: Input 'y' of 'Mul' Op has type float32 - Stack Overflow

Has anybody trained succesfuly so far?
Thanks

AastaLLL · August 27, 2018, 6:54am

Hi,

Please noticed that it’s NOT recommended to do training job on the Jetson.
TX2 is designed for inference, not suitable for back-propagation.

For your question, does your code run successfully on a desktop environment.
If yes, could you tell us which tool you use for reading inputs?

Thanks.

kilichzf · August 28, 2018, 11:58am

Hello

I m using tf-1.10 and following this tutorial on tensorflow object detection API but running it locally

It works fine on my laptop (cpu) but it is too slow

On jetson TX2 i get the following error on both python 2.7 and python 3.5

I m using wheels provided here

Traceback (most recent call last):
File “object_detection/model_main.py”, line 101, in
tf.app.run()
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”, line 125, in run
_sys.exit(main(argv))
File “object_detection/model_main.py”, line 97, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py”, line 455, in train_and_evaluate
return executor.run()
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py”, line 594, in run
return self.run_local()
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py”, line 695, in run_local
saving_listeners=saving_listeners)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 1179, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 1209, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 1167, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File “/home/nvidia/tensorflow/models/research/object_detection/model_lib.py”, line 287, in model_fn
prediction_dict, features[fields.InputDataFields.true_image_shape])
File “/home/nvidia/tensorflow/models/research/object_detection/meta_architectures/ssd_meta_arch.py”, line 686, in loss
keypoints, weights)
File “/home/nvidia/tensorflow/models/research/object_detection/meta_architectures/ssd_meta_arch.py”, line 859, in _assign_targets
groundtruth_weights_list)
File “/home/nvidia/tensorflow/models/research/object_detection/core/target_assigner.py”, line 481, in batch_assign_targets
anchors, gt_boxes, gt_class_targets, unmatched_class_label, gt_weights)
File “/home/nvidia/tensorflow/models/research/object_detection/core/target_assigner.py”, line 180, in assign
match = self._matcher.match(match_quality_matrix, **params)
File “/home/nvidia/tensorflow/models/research/object_detection/core/matcher.py”, line 239, in match
return Match(self._match(similarity_matrix, **params),
File “/home/nvidia/tensorflow/models/research/object_detection/matchers/argmax_matcher.py”, line 190, in _match
_match_when_rows_are_non_empty, _match_when_rows_are_empty)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py”, line 488, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py”, line 2074, in cond
orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py”, line 1920, in BuildCondBranch
original_result = fn()
File “/home/nvidia/tensorflow/models/research/object_detection/matchers/argmax_matcher.py”, line 153, in _match_when_rows_are_non_empty
-1)
File “/home/nvidia/tensorflow/models/research/object_detection/matchers/argmax_matcher.py”, line 203, in _set_values_using_indicator
indicator = tf.cast(1-indicator, x.dtype)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py”, line 878, in r_binary_op_wrapper
x = ops.convert_to_tensor(x, dtype=y.dtype.base_dtype, name=“x”)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 1028, in convert_to_tensor
as_ref=False)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 1124, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py”, line 228, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py”, line 207, in constant
value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_util.py”, line 442, in make_tensor_proto
_AssertCompatible(values, dtype)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_util.py”, line 353, in _AssertCompatible
(dtype.name, repr(mismatch), type(mismatch).name))
TypeError: Expected bool, got 1 of type ‘int’ instead.

AastaLLL · August 31, 2018, 3:52am

Hi,

We just announced an official TensorFlow package for Jetson TX2.
Could you give it a try?
[url]https://devtalk.nvidia.com/default/topic/1038957/jetson-tx2/tensorflow-for-jetson-tx2-/[/url]

Thanks.

kilichzf · September 1, 2018, 11:06am

Ok I have installef tf from link you provided
reinstalled object detection api
used this fix to get around protobuf compilation error

https://github.com/tensorflow/models/issues/4047

here is the new error with full output of my terminal

Thanks

nvidia@tegra-ubuntu:~/tensorflow/models/research$ ./train.sh
/usr/lib/python2.7/dist-packages/matplotlib/init.py:1352: UserWarning: This call to matplotlib.use() has no effect
because the backend has already been chosen;
matplotlib.use() must be called before pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

warnings.warn(_use_error_msg)
WARNING:tensorflow:Estimator’s model_fn (<function model_fn at 0x7f4c371758>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From /home/nvidia/tensorflow/models/research/object_detection/core/preprocessor.py:1205: calling squeeze (from tensorflow.python.ops.array_ops) with squeeze_dims is deprecated and will be removed in a future version.
Instructions for updating:
Use the axis argument instead
WARNING:root:Variable [BoxPredictor_0/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[273]], model variable shape: [[6]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_0/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 512, 273]], model variable shape: [[1, 1, 512, 6]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_1/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_1/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 1024, 546]], model variable shape: [[1, 1, 1024, 12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_2/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_2/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 512, 546]], model variable shape: [[1, 1, 512, 12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_3/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_3/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 256, 546]], model variable shape: [[1, 1, 256, 12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_4/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_4/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 256, 546]], model variable shape: [[1, 1, 256, 12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_5/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_5/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 128, 546]], model variable shape: [[1, 1, 128, 12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [global_step] is not available in checkpoint
Traceback (most recent call last):
File “object_detection/model_main.py”, line 101, in
tf.app.run()
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”, line 125, in run
_sys.exit(main(argv))
File “object_detection/model_main.py”, line 97, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py”, line 447, in train_and_evaluate
return executor.run()
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py”, line 531, in run
return self.run_local()
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py”, line 669, in run_local
hooks=train_hooks)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 1132, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 1107, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File “/home/nvidia/tensorflow/models/research/object_detection/model_lib.py”, line 287, in model_fn
prediction_dict, features[fields.InputDataFields.true_image_shape])
File “/home/nvidia/tensorflow/models/research/object_detection/meta_architectures/ssd_meta_arch.py”, line 708, in loss
weights=batch_reg_weights)
File “/home/nvidia/tensorflow/models/research/object_detection/core/losses.py”, line 74, in call
return self._compute_loss(prediction_tensor, target_tensor, **params)
File “/home/nvidia/tensorflow/models/research/object_detection/core/losses.py”, line 157, in _compute_loss
reduction=tf.losses.Reduction.NONE
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/losses/losses_impl.py”, line 442, in huber_loss
math_ops.multiply(delta, linear))
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py”, line 203, in multiply
return gen_math_ops.mul(x, y, name)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py”, line 4759, in mul
“Mul”, x=x, y=y, name=name)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py”, line 546, in _apply_op_helper
inferred_from[input_arg.type_attr]))
TypeError: Input ‘y’ of ‘Mul’ Op has type float32 that does not match type int32 of argument ‘x’.

AastaLLL · September 10, 2018, 9:51am

Hi,

Thanks for your testing.
We will try to reproduce this internally and update information with you later.

AastaLLL · September 11, 2018, 2:28am

Hi,

We want to check this internally.
Could you share the steps/script to reproduce this?

Thanks.

kilichzf · September 11, 2018, 10:38pm

I m following this guide using same dataset he provided on his github page

I have already provided details of how i installed object detection api
I have used this guide to train locally with exactly the same commands

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_locally.md

I m waiting for your response because this problem may be caused by my mistakes
Thank you for your interest

AastaLLL · September 14, 2018, 6:01am

Hi,

Could you check if your issue can be fixed with this change:
[url]https://gist.github.com/gSrikar/13e93b926d6105dc9de9e2bf2dd694c8[/url]

Thanks.

kilichzf · September 17, 2018, 9:28pm

Hi
Sorry for late response(I accidently posted this for some other thread)

Strangest thing :D

I gave up on training on tx2 after a while and deleted everything

When you answered my question i just reinstalled everything from scratch for python 2.7 to test your fix

No protobuf error!!
Ran training as root
It works like a charm

I guess i kinda messed up some stuff on previous install

I m sorry that i kinda wasted your time tho

Thanks :)
PS:
TX2 does not have enough Ram to provide good training environment
My friends laptop (gtx960m 4gb) runs training faster
Just in case :)

AastaLLL · September 20, 2018, 6:32am

Hi,

It’s good to hear training works well on your side. : )
Thanks for your update.