TF gives ValueError: Outputs of true_fn and false_fn must have the same type: int64, bool

I’m using the NGC docker Tensorflow 19.05 py3 image. I have cloned https://github.com/tensorflow/models and checked out branch r1.13.0.

Running the following without TF_ENABLE_AUTO_MIXED_PRECISION works, but when enabled like so:

TF_ENABLE_AUTO_MIXED_PRECISION=1 PYTHONPATH=. python official/resnet/cifar10_main.py --te 10 -ebe 10 --data_dir ../cifar10/cifar-10-batches-bin/

I get the following error:

Traceback (most recent call last):
  File "official/resnet/cifar10_main.py", line 278, in <module>
    absl_app.run(main)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "official/resnet/cifar10_main.py", line 272, in main
    run_cifar(flags.FLAGS)
  File "official/resnet/cifar10_main.py", line 265, in run_cifar
    shape=[HEIGHT, WIDTH, NUM_CHANNELS])
  File "/edit/tf/models-1.13/official/resnet/resnet_run_loop.py", line 564, in resnet_main
    hooks=train_hooks, max_steps=flags_obj.max_train_steps)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_distributed
    self._config._train_distribute, input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1254, in _actual_train_model_distributed
    self.config))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1199, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/one_device_strategy.py", line 144, in _call_for_each_replica
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "official/resnet/cifar10_main.py", line 230, in cifar10_model_fn
    fine_tune=params['fine_tune']
  File "/edit/tf/models-1.13/official/resnet/resnet_run_loop.py", line 410, in resnet_model_fn
    minimize_op = optimizer.apply_gradients(grad_vars, global_step)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 604, in apply_gradients
    name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2147, in cond
    (val_x.dtype.name, val_y.dtype.name))
ValueError: Outputs of true_fn and false_fn must have the same type: int64, bool

How can I fix this?

Thanks for reporting this bug. It is caused by passing a ResourceVariable as the global_step to tf.Optimizer.apply_gradients(). A fix will be available in the 19.07 release, or if you want to build from source, change do_update() in tensorflow/python/training/optimizer.py so that it matches the following.

def do_update():
      update_op = self._apply_gradients_helper(grads_and_vars, global_step,
                                               name+'-apply')
      if not isinstance(update_op, ops.Operation) or update_op.type != 'NoOp':
        update_op = control_flow_ops.group([update_op])
      return update_op