TLT Detectnet TrafficCamNet training not working

Hi, I am trying to fine-tune a resnet 18 model on a custom dataset. However, when I run the tlt-train command :
tlt-train detectnet_v2 -e /workspace/nvidia_tlt_models/tlt_pretrained_detectnet_v2_vresnet18/detectnet_v2_train_resnet18_kitti.txt -r /workspace/kitti_data/results -k tlt_encode

I am getting an error as follows:

==================================================================================================
2020-09-01 21:51:48.050217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
2020-09-01 21:51:48.050258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-01 21:51:48.050283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-09-01 21:51:48.050302: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-09-01 21:51:48.794194: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key block_1a_bn_1/beta/Adam not found in checkpoint
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key block_1a_bn_1/beta/Adam not found in checkpoint
[[{{node save/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 1290, in restore
{self.saver_def.filename_tensor_name: save_path})
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key block_1a_bn_1/beta/Adam not found in checkpoint
[[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for ‘save/RestoreV2’:
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 624, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 147, in run_training_loop
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py”, line 143, in get_singular_monitored_session
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1104, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 727, in init
self._sess = self._coordinated_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 638, in create_session
self._scaffold.finalize()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 229, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 599, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 828, in init
self.build()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 840, in build
self._build(self._filename, build_save=True, build_restore=True) [58/360]
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 878, in _build
build_restore=build_restore)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 502, in _build_internal
restore_sequentially, reshape)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 381, in _AddShardedRestoreOps
name=“restore_shard”))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 328, in _AddRestoreOps
restore_sequentially)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_io_ops.py”, line 1696, in restore_v2
name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 1300, in restore
names_to_keys = object_graph_key_mapping(save_path)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 1618, in object_graph_key_mapping
object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py”, line 915, in get_tensor
return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 624, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 147, in run_training_loop
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py”, line 143, in get_singular_monitored_session
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1104, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 727, in init
self._sess = self._coordinated_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 647, in create_session
init_fn=self._scaffold.init_fn)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py”, line 290, in prepare_session
config=config)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py”, line 204, in _restore_checkpoint
saver.restore(sess, checkpoint_filename_with_path)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 1306, in restore
err, “a Variable name or other graph key that is missing”)
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint.
Original error:

Key block_1a_bn_1/beta/Adam not found in checkpoint
[[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for ‘save/RestoreV2’:
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 624, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 147, in run_training_loop
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py”, line 143, in get_singular_monitored_session
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1104, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 727, in init
self._sess = self._coordinated_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 638, in create_session
self._scaffold.finalize()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 229, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 599, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 828, in init
self.build()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 840, in build
self._build(self._filename, build_save=True, build_restore=True)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 878, in _build
build_restore=build_restore)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 502, in _build_internal
restore_sequentially, reshape)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 381, in _AddShardedRestoreOps
name=“restore_shard”))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 328, in _AddRestoreOps
restore_sequentially)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_io_ops.py”, line 1696, in restore_v2
name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

==================================================================================================

Here is my training config file:

detectnet_v2_train_resnet18_kitti.txt (6.2 KB)

I do not understand what I am doing wrong. Can someone please help me figure this out??

Please create a new result folder(results_new) and try again.

-r /workspace/kitti_data/results_new

Thanks so much for the support. I am getting a new error now.
Just to add some more context, I am using nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3

I do understand the issue, but I am unsure as to what to do about it.

Here is the tfrecord creation config I used.
conversion_spec_file_trainval.txt (230 Bytes)

I did comb through the docs. I only see an option to change nchw to nhwc in the tlt-encode command.

Below is the error I am seeing.

================================================================================
2020-09-02 15:01:12.714448: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero [68/327]
2020-09-02 15:01:12.715336: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-02 15:01:12.716110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
2020-09-02 15:01:12.716150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-02 15:01:12.716176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-09-02 15:01:12.716201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-09-02 15:01:46.127119: E tensorflow/core/common_runtime/executor.cc:648] Executor failed to create kernel. Invalid argument: Conv2DCustomBackpropInputOp only supports NHWC.
[[{{node gradients/resnet18_nopool_bn_detectnet_v2/output_bbox/convolution_grad/Conv2DBackpropInput}}]]
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Conv2DCustomBackpropInputOp only supports NHWC.
[[{{node gradients/resnet18_nopool_bn_detectnet_v2/output_bbox/convolution_grad/Conv2DBackpropInput}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 624, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 149, in run_training_loop
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 754, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1360, in run
raise six.reraise(*original_exc_info)
File “/usr/local/lib/python3.6/dist-packages/six.py”, line 693, in reraise
raise value
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1345, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1418, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1176, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Conv2DCustomBackpropInputOp only supports NHWC.
[[node gradients/resnet18_nopool_bn_detectnet_v2/output_bbox/convolution_grad/Conv2DBackpropInput (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Hi @seby1,
I did not understand your new error.
What is tlt-encode command?

Can you give more details about what command did you run, what is the spec and so on?
Thanks.

The error below is coming when I am running the following command:
tlt-train detectnet_v2 -e /workspace/nvidia_tlt_models/tlt_pretrained_detectnet_v2_vresnet18/spec_file.txt -r /workspace/kitti_data/results_new/ -k tlt-encode

results_new is a newly created directory.

Here is the training config spec file: detectnet_v2_train_resnet18_kitti.txt (6.2 KB)

Just to add some more context, I am using nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3

I am using the following GPU (Captured part of nvidia-smi)
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 38C P8 31W / 149W | 0MiB / 11441MiB | 0% Default |

I do understand the issue, but I am unsure as to what to do about it.

Here is the tfrecord creation config I used.
conversion_spec_file_trainval.txt (230 Bytes)

I did comb through the docs. I only see an option to change nchw to nhwc in the tlt-converter command.

Below is the error I am seeing. (CAPTURED THE FULL OUTPUT FOR MORE CONTEXT) (ISSUE HIGHLIGHTED IN BOLD)

================================================================================
Using TensorFlow backend.
2020-09-03 15:48:35.103288: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0


[[63314,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: 0abc4a9d1a0a

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.


2020-09-03 15:48:38.478499: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-09-03 15:48:38.504253: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:48:38.505140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
2020-09-03 15:48:38.505192: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-09-03 15:48:38.505285: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-09-03 15:48:38.506753: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-09-03 15:48:38.507477: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-09-03 15:48:38.509528: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-09-03 15:48:38.511034: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-09-03 15:48:38.511111: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-09-03 15:48:38.511292: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:48:38.512191: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:48:38.512990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
2020-09-03 15:48:38.633671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-03 15:48:38.633725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-09-03 15:48:38.633746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-09-03 15:48:38,634 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at /workspace/nvidia_tlt_models/tlt_pretrained_detectnet_v2_vresnet18/spec_file.txt.
2020-09-03 15:48:38,636 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/nvidia_tlt_models/tlt_pretrained_detectnet_v2_vresnet18/spec_file.txt
2020-09-03 15:48:40,122 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 6814 samples with a batch size of 24; each epoch will therefore take one extra step.


Layer (type) Output Shape Param # Connected to

==================================================================================================
input_1 (InputLayer) (None, 3, 544, 960) 0


conv1 (Conv2D) (None, 64, 272, 480) 9472 input_1[0][0]


bn_conv1 (BatchNormalization) (None, 64, 272, 480) 256 conv1[0][0]


activation_1 (Activation) (None, 64, 272, 480) 0 bn_conv1[0][0]


block_1a_conv_1 (Conv2D) (None, 64, 136, 240) 36928 activation_1[0][0]


block_1a_bn_1 (BatchNormalizati (None, 64, 136, 240) 256 block_1a_conv_1[0][0]


block_1a_relu_1 (Activation) (None, 64, 136, 240) 0 block_1a_bn_1[0][0]


block_1a_conv_2 (Conv2D) (None, 64, 136, 240) 36928 block_1a_relu_1[0][0]


block_1a_conv_shortcut (Conv2D) (None, 64, 136, 240) 4160 activation_1[0][0]


block_1a_bn_2 (BatchNormalizati (None, 64, 136, 240) 256 block_1a_conv_2[0][0]


block_1a_bn_shortcut (BatchNorm (None, 64, 136, 240) 256 block_1a_conv_shortcut[0][0]


add_1 (Add) (None, 64, 136, 240) 0 block_1a_bn_2[0][0]
block_1a_bn_shortcut[0][0]


block_1a_relu (Activation) (None, 64, 136, 240) 0 add_1[0][0]


block_1b_conv_1 (Conv2D) (None, 64, 136, 240) 36928 block_1a_relu[0][0]


block_1b_bn_1 (BatchNormalizati (None, 64, 136, 240) 256 block_1b_conv_1[0][0]


block_1b_relu_1 (Activation) (None, 64, 136, 240) 0 block_1b_bn_1[0][0]


block_1b_conv_2 (Conv2D) (None, 64, 136, 240) 36928 block_1b_relu_1[0][0]


block_1b_bn_2 (BatchNormalizati (None, 64, 136, 240) 256 block_1b_conv_2[0][0]


add_2 (Add) (None, 64, 136, 240) 0 block_1b_bn_2[0][0]
block_1a_relu[0][0]


block_1b_relu (Activation) (None, 64, 136, 240) 0 add_2[0][0]


block_2a_conv_1 (Conv2D) (None, 128, 68, 120) 73856 block_1b_relu[0][0]


block_2a_bn_1 (BatchNormalizati (None, 128, 68, 120) 512 block_2a_conv_1[0][0]


block_2a_relu_1 (Activation) (None, 128, 68, 120) 0 block_2a_bn_1[0][0]


block_2a_conv_2 (Conv2D) (None, 128, 68, 120) 147584 block_2a_relu_1[0][0]


block_2a_conv_shortcut (Conv2D) (None, 128, 68, 120) 8320 block_1b_relu[0][0]


block_2a_bn_2 (BatchNormalizati (None, 128, 68, 120) 512 block_2a_conv_2[0][0]


block_2a_bn_shortcut (BatchNorm (None, 128, 68, 120) 512 block_2a_conv_shortcut[0][0]


add_3 (Add) (None, 128, 68, 120) 0 block_2a_bn_2[0][0]
block_2a_bn_shortcut[0][0]


block_2a_relu (Activation) (None, 128, 68, 120) 0 add_3[0][0]


block_2b_conv_1 (Conv2D) (None, 128, 68, 120) 147584 block_2a_relu[0][0]


block_2b_bn_1 (BatchNormalizati (None, 128, 68, 120) 512 block_2b_conv_1[0][0]


block_2b_relu_1 (Activation) (None, 128, 68, 120) 0 block_2b_bn_1[0][0]


block_2b_conv_2 (Conv2D) (None, 128, 68, 120) 147584 block_2b_relu_1[0][0]


block_2b_bn_2 (BatchNormalizati (None, 128, 68, 120) 512 block_2b_conv_2[0][0]


add_4 (Add) (None, 128, 68, 120) 0 block_2b_bn_2[0][0]
block_2a_relu[0][0]


block_2b_relu (Activation) (None, 128, 68, 120) 0 add_4[0][0]


block_3a_conv_1 (Conv2D) (None, 256, 34, 60) 295168 block_2b_relu[0][0]


block_3a_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3a_conv_1[0][0]


block_3a_relu_1 (Activation) (None, 256, 34, 60) 0 block_3a_bn_1[0][0]


block_3a_conv_2 (Conv2D) (None, 256, 34, 60) 590080 block_3a_relu_1[0][0]


block_3a_conv_shortcut (Conv2D) (None, 256, 34, 60) 33024 block_2b_relu[0][0]


block_3a_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3a_conv_2[0][0]


block_3a_bn_shortcut (BatchNorm (None, 256, 34, 60) 1024 block_3a_conv_shortcut[0][0]


add_5 (Add) (None, 256, 34, 60) 0 block_3a_bn_2[0][0]
block_3a_bn_shortcut[0][0]


block_3a_relu (Activation) (None, 256, 34, 60) 0 add_5[0][0]


block_3b_conv_1 (Conv2D) (None, 256, 34, 60) 590080 block_3a_relu[0][0]


block_3b_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3b_conv_1[0][0]


block_3b_relu_1 (Activation) (None, 256, 34, 60) 0 block_3b_bn_1[0][0]


block_3b_conv_2 (Conv2D) (None, 256, 34, 60) 590080 block_3b_relu_1[0][0]


block_3b_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3b_conv_2[0][0]


add_6 (Add) (None, 256, 34, 60) 0 block_3b_bn_2[0][0]
block_3a_relu[0][0]


block_3b_relu (Activation) (None, 256, 34, 60) 0 add_6[0][0]


block_4a_conv_1 (Conv2D) (None, 512, 34, 60) 1180160 block_3b_relu[0][0]


block_4a_bn_1 (BatchNormalizati (None, 512, 34, 60) 2048 block_4a_conv_1[0][0]


block_4a_relu_1 (Activation) (None, 512, 34, 60) 0 block_4a_bn_1[0][0]


block_4a_conv_2 (Conv2D) (None, 512, 34, 60) 2359808 block_4a_relu_1[0][0]


block_4a_conv_shortcut (Conv2D) (None, 512, 34, 60) 131584 block_3b_relu[0][0]


block_4a_bn_2 (BatchNormalizati (None, 512, 34, 60) 2048 block_4a_conv_2[0][0]


block_4a_bn_shortcut (BatchNorm (None, 512, 34, 60) 2048 block_4a_conv_shortcut[0][0]


add_7 (Add) (None, 512, 34, 60) 0 block_4a_bn_2[0][0]
block_4a_bn_shortcut[0][0]
__________________________________________________________________________________________________ [142/319]
block_4a_relu (Activation) (None, 512, 34, 60) 0 add_7[0][0]


block_4b_conv_1 (Conv2D) (None, 512, 34, 60) 2359808 block_4a_relu[0][0]


block_4b_bn_1 (BatchNormalizati (None, 512, 34, 60) 2048 block_4b_conv_1[0][0]


block_4b_relu_1 (Activation) (None, 512, 34, 60) 0 block_4b_bn_1[0][0]


block_4b_conv_2 (Conv2D) (None, 512, 34, 60) 2359808 block_4b_relu_1[0][0]


block_4b_bn_2 (BatchNormalizati (None, 512, 34, 60) 2048 block_4b_conv_2[0][0]


add_8 (Add) (None, 512, 34, 60) 0 block_4b_bn_2[0][0]
block_4a_relu[0][0]


block_4b_relu (Activation) (None, 512, 34, 60) 0 add_8[0][0]


output_bbox (Conv2D) (None, 16, 34, 60) 8208 block_4b_relu[0][0]


output_cov (Conv2D) (None, 4, 34, 60) 2052 block_4b_relu[0][0]

==================================================================================================
Total params: 11,205,588
Trainable params: 11,195,860
Non-trainable params: 9,728


2020-09-03 15:48:51,341 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-09-03 15:48:51,341 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-09-03 15:48:51,341 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-09-03 15:48:51,341 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2020-09-03 15:48:51,341 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 6814, number of sources: 1, batch size per gpu: 24, steps: 284
2020-09-03 15:48:51,470 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2020-09-03 15:48:51.511818: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:48:51.512669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
2020-09-03 15:48:51.512726: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-09-03 15:48:51.512815: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-09-03 15:48:51.512923: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-09-03 15:48:51.512989: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-09-03 15:48:51.513056: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-09-03 15:48:51.513122: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-09-03 15:48:51.513181: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-09-03 15:48:51.513312: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:48:51.514188: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:48:51.514956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
2020-09-03 15:48:51,811 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 1
2020-09-03 15:48:51,819 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-09-03 15:48:51,820 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2020-09-03 15:48:52.201543: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-09-03 15:48:52,483 [INFO] iva.detectnet_v2.scripts.train: Found 6814 samples in training set
2020-09-03 15:48:55,919 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-09-03 15:48:55,919 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-09-03 15:48:55,919 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-09-03 15:48:55,919 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2020-09-03 15:48:55,919 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 9875, number of sources: 1, batch size per gpu: 24, steps: 412
2020-09-03 15:48:55,959 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2020-09-03 15:48:56,283 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2020-09-03 15:48:56,291 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-09-03 15:48:56,291 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2020-09-03 15:48:56,742 [INFO] iva.detectnet_v2.scripts.train: Found 9875 samples in validation set [82/319]
2020-09-03 15:49:00.694744: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:49:00.695625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
2020-09-03 15:49:00.695694: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-09-03 15:49:00.695784: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-09-03 15:49:00.695867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-09-03 15:49:00.695929: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-09-03 15:49:00.695988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-09-03 15:49:00.696053: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-09-03 15:49:00.696105: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-09-03 15:49:00.696244: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:49:00.697224: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:49:00.698011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
2020-09-03 15:49:00.698051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-03 15:49:00.698078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-09-03 15:49:00.698103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-09-03 15:49:36.047342: E tensorflow/core/common_runtime/executor.cc:648] Executor failed to create kernel. Invalid argument: Conv2DCustomBackpropInputOp only supports NHWC.
*** [[{{node gradients/resnet18_nopool_bn_detectnet_v2/output_bbox/convolution_grad/Conv2DBackpropInput}}]]***
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Conv2DCustomBackpropInputOp only supports NHWC.[[{{node gradients/resnet18_nopool_bn_detectnet_v2/output_bbox/convolution_grad/Conv2DBackpropInput}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 624, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 149, in run_training_loop
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 754, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1360, in run
raise six.reraise(original_exc_info)
File “/usr/local/lib/python3.6/dist-packages/six.py”, line 693, in reraise
raise value
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1345, in run
return self._sess.run(args, kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1418, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1176, in run
return self._sess.run(args, kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call [22/319]
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Conv2DCustomBackpropInputOp only supports NHWC.
*** [[node gradients/resnet18_nopool_bn_detectnet_v2/output_bbox/convolution_grad/Conv2DBackpropInput (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]***

Original stack trace for ‘gradients/resnet18_nopool_bn_detectnet_v2/output_bbox/convolution_grad/Conv2DBackpropInput’:
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 599, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 454, in build_training_graph
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/model/detectnet_model.py”, line 583, in build_training_graph
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/train_op_generator.py”, line 59, in get_train_op
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/train_op_generator.py”, line 74, in _get_train_op_without_cost
_scaling
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/optimizer.py”, line 419, in minimize
grad_loss=grad_loss)
File “/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py”, line 253, in compute_gradients
gradients = self._optimizer.compute_gradients(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/optimizer.py”, line 537, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_impl.py”, line 158, in gradients
unconnected_gradients)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py”, line 703, in _GradientsHelper
lambda: grad_fn(op, *out_grads))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py”, line 362, in _MaybeCompile
return grad_fn() # Exit early
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py”, line 703, in
lambda: grad_fn(op, *out_grads))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_grad.py”, line 596, in _Conv2DGrad
data_format=data_format),
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py”, line 1407, in conv2d_backprop_input
name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

…which was originally created as op ‘resnet18_nopool_bn_detectnet_v2/output_bbox/convolution’, defined at:
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
[elided 6 identical lines from previous traceback]
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 454, in build_training_graph
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/model/detectnet_model.py”, line 557, in build_training_graph
File “/usr/local/lib/python3.6/dist-packages/keras/engine/base_layer.py”, line 457, in call
output = self.call(inputs, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/network.py”, line 564, in call
output_tensors, _, _ = self.run_internal_graph(inputs, masks)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/network.py”, line 721, in run_internal_graph
layer.call(computed_tensor, **kwargs))
File “/usr/local/lib/python3.6/dist-packages/keras/layers/convolutional.py”, line 171, in call
dilation_rate=self.dilation_rate)
File “/opt/nvidia/third_party/keras/tensorflow_backend.py”, line 102, in conv2d
data_format=tf_data_format,
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py”, line 921, in convolution
name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py”, line 1032, in convolution_internal
name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py”, line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

Is your dataset CHW or HWC format, rgb or bgr?
Could you please try to narrow down with a small part of your training dataset?

All of my images are in RGB and in the kitti-format as recommended by TLT for training a detectnet on a custom dataset.

I used tlt-dataset-convert -d conversion_spec_file_trainval.txt -o /workspace/kitti_data/train/train_tfrecords as instructed here

The conversion_spec_file_trainval.txt is : conversion_spec_file_trainval.txt (230 Bytes)

All the training images are saved in the RGB format (544,960,3) in kitti_data/train/images/.

I see that the tfrecords I create through this command only point to the image locations.
I however do not understand why the tlt-train command decides to load the images in the nchw format instead of nhwc.

Could you try to use public KITTI dataset to check firstly?

I figured out the issue. The dataset was being read in as NCHW instead of NHWC because the GPU on the AWS instance I was using was CUDA COMPUTE < 5.2. This lead the training script to ignore the GPU and try to use the CPU instead. Shifting to a beefier GPU and creating the results directory new everytime solved the issue.