TX2 Tensorflow 1.10 Training error

Hi
I m trying to train a mobile ssd on jetson tx2 with tensorflow object detection api
Since my question is long i will just intert my stackoverflow linek

https://stackoverflow.com/questions/51955693/tensorflow-object-detection-api-training-error-typeerror-input-y-of-mul-op/51956741#51956741

Has anybody trained succesfuly so far?
Thanks

Hi,

Please noticed that it’s NOT recommended to do training job on the Jetson.
TX2 is designed for inference, not suitable for back-propagation.

For your question, does your code run successfully on a desktop environment.
If yes, could you tell us which tool you use for reading inputs?

Thanks.

Hello

I m using tf-1.10 and following this tutorial on tensorflow object detection API but running it locally

https://towardsdatascience.com/how-to-train-your-own-object-detector-with-tensorflows-object-detector-api-bec72ecfe1d9

It works fine on my laptop (cpu) but it is too slow

On jetson TX2 i get the following error on both python 2.7 and python 3.5

I m using wheels provided here

https://devtalk.nvidia.com/default/topic/1031300/jetson-tx2/tensorflow-1-10-wheel-with-jetpack-3-3/

Traceback (most recent call last):
File “object_detection/model_main.py”, line 101, in
tf.app.run()
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”, line 125, in run
_sys.exit(main(argv))
File “object_detection/model_main.py”, line 97, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py”, line 455, in train_and_evaluate
return executor.run()
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py”, line 594, in run
return self.run_local()
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py”, line 695, in run_local
saving_listeners=saving_listeners)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 1179, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 1209, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 1167, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File “/home/nvidia/tensorflow/models/research/object_detection/model_lib.py”, line 287, in model_fn
prediction_dict, features[fields.InputDataFields.true_image_shape])
File “/home/nvidia/tensorflow/models/research/object_detection/meta_architectures/ssd_meta_arch.py”, line 686, in loss
keypoints, weights)
File “/home/nvidia/tensorflow/models/research/object_detection/meta_architectures/ssd_meta_arch.py”, line 859, in _assign_targets
groundtruth_weights_list)
File “/home/nvidia/tensorflow/models/research/object_detection/core/target_assigner.py”, line 481, in batch_assign_targets
anchors, gt_boxes, gt_class_targets, unmatched_class_label, gt_weights)
File “/home/nvidia/tensorflow/models/research/object_detection/core/target_assigner.py”, line 180, in assign
match = self._matcher.match(match_quality_matrix, **params)
File “/home/nvidia/tensorflow/models/research/object_detection/core/matcher.py”, line 239, in match
return Match(self._match(similarity_matrix, **params),
File “/home/nvidia/tensorflow/models/research/object_detection/matchers/argmax_matcher.py”, line 190, in _match
_match_when_rows_are_non_empty, _match_when_rows_are_empty)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py”, line 488, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py”, line 2074, in cond
orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py”, line 1920, in BuildCondBranch
original_result = fn()
File “/home/nvidia/tensorflow/models/research/object_detection/matchers/argmax_matcher.py”, line 153, in _match_when_rows_are_non_empty
-1)
File “/home/nvidia/tensorflow/models/research/object_detection/matchers/argmax_matcher.py”, line 203, in _set_values_using_indicator
indicator = tf.cast(1-indicator, x.dtype)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py”, line 878, in r_binary_op_wrapper
x = ops.convert_to_tensor(x, dtype=y.dtype.base_dtype, name=“x”)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 1028, in convert_to_tensor
as_ref=False)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 1124, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py”, line 228, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py”, line 207, in constant
value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_util.py”, line 442, in make_tensor_proto
_AssertCompatible(values, dtype)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_util.py”, line 353, in _AssertCompatible
(dtype.name, repr(mismatch), type(mismatch).name))
TypeError: Expected bool, got 1 of type ‘int’ instead.

Hi,

We just announced an official TensorFlow package for Jetson TX2.
Could you give it a try?
https://devtalk.nvidia.com/default/topic/1038957/jetson-tx2/tensorflow-for-jetson-tx2-/

Thanks.

Ok I have installef tf from link you provided
reinstalled object detection api
used this fix to get around protobuf compilation error

https://github.com/tensorflow/models/issues/4047

here is the new error with full output of my terminal

Thanks

nvidia@tegra-ubuntu:~/tensorflow/models/research$ ./train.sh
/usr/lib/python2.7/dist-packages/matplotlib/init.py:1352: UserWarning: This call to matplotlib.use() has no effect
because the backend has already been chosen;
matplotlib.use() must be called before pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

warnings.warn(_use_error_msg)
WARNING:tensorflow:Estimator’s model_fn (<function model_fn at 0x7f4c371758>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From /home/nvidia/tensorflow/models/research/object_detection/core/preprocessor.py:1205: calling squeeze (from tensorflow.python.ops.array_ops) with squeeze_dims is deprecated and will be removed in a future version.
Instructions for updating:
Use the axis argument instead
WARNING:root:Variable [BoxPredictor_0/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[273]], model variable shape: [[6]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_0/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 512, 273]], model variable shape: [[1, 1, 512, 6]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_1/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_1/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 1024, 546]], model variable shape: [[1, 1, 1024, 12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_2/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_2/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 512, 546]], model variable shape: [[1, 1, 512, 12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_3/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_3/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 256, 546]], model variable shape: [[1, 1, 256, 12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_4/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_4/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 256, 546]], model variable shape: [[1, 1, 256, 12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_5/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[546]], model variable shape: [[12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [BoxPredictor_5/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 128, 546]], model variable shape: [[1, 1, 128, 12]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [global_step] is not available in checkpoint
Traceback (most recent call last):
File “object_detection/model_main.py”, line 101, in
tf.app.run()
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”, line 125, in run
_sys.exit(main(argv))
File “object_detection/model_main.py”, line 97, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py”, line 447, in train_and_evaluate
return executor.run()
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py”, line 531, in run
return self.run_local()
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py”, line 669, in run_local
hooks=train_hooks)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 1132, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py”, line 1107, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File “/home/nvidia/tensorflow/models/research/object_detection/model_lib.py”, line 287, in model_fn
prediction_dict, features[fields.InputDataFields.true_image_shape])
File “/home/nvidia/tensorflow/models/research/object_detection/meta_architectures/ssd_meta_arch.py”, line 708, in loss
weights=batch_reg_weights)
File “/home/nvidia/tensorflow/models/research/object_detection/core/losses.py”, line 74, in call
return self._compute_loss(prediction_tensor, target_tensor, **params)
File “/home/nvidia/tensorflow/models/research/object_detection/core/losses.py”, line 157, in _compute_loss
reduction=tf.losses.Reduction.NONE
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/losses/losses_impl.py”, line 442, in huber_loss
math_ops.multiply(delta, linear))
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py”, line 203, in multiply
return gen_math_ops.mul(x, y, name)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py”, line 4759, in mul
“Mul”, x=x, y=y, name=name)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py”, line 546, in _apply_op_helper
inferred_from[input_arg.type_attr]))
TypeError: Input ‘y’ of ‘Mul’ Op has type float32 that does not match type int32 of argument ‘x’.

Hi,

Thanks for your testing.
We will try to reproduce this internally and update information with you later.

Hi,

We want to check this internally.
Could you share the steps/script to reproduce this?

Thanks.

I m following this guide using same dataset he provided on his github page

I have already provided details of how i installed object detection api
I have used this guide to train locally with exactly the same commands

I m waiting for your response because this problem may be caused by my mistakes
Thank you for your interest

Hi,

Could you check if your issue can be fixed with this change:
https://gist.github.com/gSrikar/13e93b926d6105dc9de9e2bf2dd694c8

Thanks.

Hi
Sorry for late response(I accidently posted this for some other thread)

Strangest thing :D

I gave up on training on tx2 after a while and deleted everything

When you answered my question i just reinstalled everything from scratch for python 2.7 to test your fix

No protobuf error!!
Ran training as root
It works like a charm

I guess i kinda messed up some stuff on previous install

I m sorry that i kinda wasted your time tho

Thanks :)
PS:
TX2 does not have enough Ram to provide good training environment
My friends laptop (gtx960m 4gb) runs training faster
Just in case :)

Hi,

It’s good to hear training works well on your side. : )
Thanks for your update.