TLT Detectnet TrafficCamNet training not working

seby1 · September 1, 2020, 10:01pm

Hi, I am trying to fine-tune a resnet 18 model on a custom dataset. However, when I run the tlt-train command :
tlt-train detectnet_v2 -e /workspace/nvidia_tlt_models/tlt_pretrained_detectnet_v2_vresnet18/detectnet_v2_train_resnet18_kitti.txt -r /workspace/kitti_data/results -k tlt_encode

I am getting an error as follows:

==================================================================================================
2020-09-01 21:51:48.050217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
2020-09-01 21:51:48.050258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-01 21:51:48.050283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-09-01 21:51:48.050302: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-09-01 21:51:48.794194: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key block_1a_bn_1/beta/Adam not found in checkpoint
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key block_1a_bn_1/beta/Adam not found in checkpoint
[[{{node save/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 1290, in restore
{self.saver_def.filename_tensor_name: save_path})
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key block_1a_bn_1/beta/Adam not found in checkpoint
[[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for ‘save/RestoreV2’:
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 624, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 147, in run_training_loop
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py”, line 143, in get_singular_monitored_session
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1104, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 727, in init
self._sess = self._coordinated_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 638, in create_session
self._scaffold.finalize()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 229, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 599, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 828, in init
self.build()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 840, in build
self._build(self._filename, build_save=True, build_restore=True) [58/360]
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 878, in _build
build_restore=build_restore)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 502, in _build_internal
restore_sequentially, reshape)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 381, in _AddShardedRestoreOps
name=“restore_shard”))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 328, in _AddRestoreOps
restore_sequentially)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_io_ops.py”, line 1696, in restore_v2
name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 1300, in restore
names_to_keys = object_graph_key_mapping(save_path)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 1618, in object_graph_key_mapping
object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py”, line 915, in get_tensor
return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 624, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 147, in run_training_loop
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py”, line 143, in get_singular_monitored_session
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1104, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 727, in init
self._sess = self._coordinated_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 647, in create_session
init_fn=self._scaffold.init_fn)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py”, line 290, in prepare_session
config=config)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py”, line 204, in _restore_checkpoint
saver.restore(sess, checkpoint_filename_with_path)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 1306, in restore
err, “a Variable name or other graph key that is missing”)
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint.
Original error:

Key block_1a_bn_1/beta/Adam not found in checkpoint
[[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for ‘save/RestoreV2’:
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 624, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 147, in run_training_loop
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py”, line 143, in get_singular_monitored_session
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1104, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 727, in init
self._sess = self._coordinated_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 638, in create_session
self._scaffold.finalize()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 229, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 599, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 828, in init
self.build()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 840, in build
self._build(self._filename, build_save=True, build_restore=True)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 878, in _build
build_restore=build_restore)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 502, in _build_internal
restore_sequentially, reshape)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 381, in _AddShardedRestoreOps
name=“restore_shard”))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 328, in _AddRestoreOps
restore_sequentially)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_io_ops.py”, line 1696, in restore_v2
name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

==================================================================================================

Here is my training config file:

detectnet_v2_train_resnet18_kitti.txt (6.2 KB)

I do not understand what I am doing wrong. Can someone please help me figure this out??

Morganh · September 2, 2020, 3:11am

Please create a new result folder(results_new) and try again.

-r /workspace/kitti_data/results_new

seby1 · September 2, 2020, 3:08pm

Thanks so much for the support. I am getting a new error now.
Just to add some more context, I am using nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3

I do understand the issue, but I am unsure as to what to do about it.

Here is the tfrecord creation config I used.
conversion_spec_file_trainval.txt (230 Bytes)

I did comb through the docs. I only see an option to change nchw to nhwc in the tlt-encode command.

Below is the error I am seeing.

================================================================================
2020-09-02 15:01:12.714448: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero [68/327]
2020-09-02 15:01:12.715336: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-02 15:01:12.716110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
2020-09-02 15:01:12.716150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-02 15:01:12.716176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-09-02 15:01:12.716201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-09-02 15:01:46.127119: E tensorflow/core/common_runtime/executor.cc:648] Executor failed to create kernel. Invalid argument: Conv2DCustomBackpropInputOp only supports NHWC.
[[{{node gradients/resnet18_nopool_bn_detectnet_v2/output_bbox/convolution_grad/Conv2DBackpropInput}}]]
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Conv2DCustomBackpropInputOp only supports NHWC.
[[{{node gradients/resnet18_nopool_bn_detectnet_v2/output_bbox/convolution_grad/Conv2DBackpropInput}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 624, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 149, in run_training_loop
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 754, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1360, in run
raise six.reraise(*original_exc_info)
File “/usr/local/lib/python3.6/dist-packages/six.py”, line 693, in reraise
raise value
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1345, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1418, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1176, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Conv2DCustomBackpropInputOp only supports NHWC.
[[node gradients/resnet18_nopool_bn_detectnet_v2/output_bbox/convolution_grad/Conv2DBackpropInput (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Morganh · September 3, 2020, 7:53am

Hi @seby1,
I did not understand your new error.
What is tlt-encode command?

Can you give more details about what command did you run, what is the spec and so on?
Thanks.

seby1 · September 3, 2020, 3:52pm

The error below is coming when I am running the following command:
tlt-train detectnet_v2 -e /workspace/nvidia_tlt_models/tlt_pretrained_detectnet_v2_vresnet18/spec_file.txt -r /workspace/kitti_data/results_new/ -k tlt-encode

results_new is a newly created directory.

Here is the training config spec file: detectnet_v2_train_resnet18_kitti.txt (6.2 KB)

Just to add some more context, I am using nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3

I do understand the issue, but I am unsure as to what to do about it.

Here is the tfrecord creation config I used.
conversion_spec_file_trainval.txt (230 Bytes)

I did comb through the docs. I only see an option to change nchw to nhwc in the tlt-converter command.

Below is the error I am seeing. (CAPTURED THE FULL OUTPUT FOR MORE CONTEXT) (ISSUE HIGHLIGHTED IN BOLD)

================================================================================
Using TensorFlow backend.
2020-09-03 15:48:35.103288: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[[63314,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: 0abc4a9d1a0a

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.

2020-09-03 15:48:38.478499: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-09-03 15:48:38.504253: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:48:38.505140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
2020-09-03 15:48:38.505192: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-09-03 15:48:38.505285: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-09-03 15:48:38.506753: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-09-03 15:48:38.507477: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-09-03 15:48:38.509528: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-09-03 15:48:38.511034: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-09-03 15:48:38.511111: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-09-03 15:48:38.511292: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:48:38.512191: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:48:38.512990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
2020-09-03 15:48:38.633671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-03 15:48:38.633725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-09-03 15:48:38.633746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-09-03 15:48:38,634 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at /workspace/nvidia_tlt_models/tlt_pretrained_detectnet_v2_vresnet18/spec_file.txt.
2020-09-03 15:48:38,636 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/nvidia_tlt_models/tlt_pretrained_detectnet_v2_vresnet18/spec_file.txt
2020-09-03 15:48:40,122 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 6814 samples with a batch size of 24; each epoch will therefore take one extra step.

Layer (type) Output Shape Param # Connected to

==================================================================================================
input_1 (InputLayer) (None, 3, 544, 960) 0

conv1 (Conv2D) (None, 64, 272, 480) 9472 input_1[0][0]

bn_conv1 (BatchNormalization) (None, 64, 272, 480) 256 conv1[0][0]

activation_1 (Activation) (None, 64, 272, 480) 0 bn_conv1[0][0]

block_1a_conv_1 (Conv2D) (None, 64, 136, 240) 36928 activation_1[0][0]

block_1a_bn_1 (BatchNormalizati (None, 64, 136, 240) 256 block_1a_conv_1[0][0]

block_1a_relu_1 (Activation) (None, 64, 136, 240) 0 block_1a_bn_1[0][0]

block_1a_conv_2 (Conv2D) (None, 64, 136, 240) 36928 block_1a_relu_1[0][0]

block_1a_conv_shortcut (Conv2D) (None, 64, 136, 240) 4160 activation_1[0][0]

block_1a_bn_2 (BatchNormalizati (None, 64, 136, 240) 256 block_1a_conv_2[0][0]

block_1a_bn_shortcut (BatchNorm (None, 64, 136, 240) 256 block_1a_conv_shortcut[0][0]

add_1 (Add) (None, 64, 136, 240) 0 block_1a_bn_2[0][0]
block_1a_bn_shortcut[0][0]

block_1a_relu (Activation) (None, 64, 136, 240) 0 add_1[0][0]

block_1b_conv_1 (Conv2D) (None, 64, 136, 240) 36928 block_1a_relu[0][0]

block_1b_bn_1 (BatchNormalizati (None, 64, 136, 240) 256 block_1b_conv_1[0][0]

block_1b_relu_1 (Activation) (None, 64, 136, 240) 0 block_1b_bn_1[0][0]

block_1b_conv_2 (Conv2D) (None, 64, 136, 240) 36928 block_1b_relu_1[0][0]

block_1b_bn_2 (BatchNormalizati (None, 64, 136, 240) 256 block_1b_conv_2[0][0]

add_2 (Add) (None, 64, 136, 240) 0 block_1b_bn_2[0][0]
block_1a_relu[0][0]

block_1b_relu (Activation) (None, 64, 136, 240) 0 add_2[0][0]

block_2a_conv_1 (Conv2D) (None, 128, 68, 120) 73856 block_1b_relu[0][0]

block_2a_bn_1 (BatchNormalizati (None, 128, 68, 120) 512 block_2a_conv_1[0][0]

block_2a_relu_1 (Activation) (None, 128, 68, 120) 0 block_2a_bn_1[0][0]

block_2a_conv_2 (Conv2D) (None, 128, 68, 120) 147584 block_2a_relu_1[0][0]

block_2a_conv_shortcut (Conv2D) (None, 128, 68, 120) 8320 block_1b_relu[0][0]

block_2a_bn_2 (BatchNormalizati (None, 128, 68, 120) 512 block_2a_conv_2[0][0]

block_2a_bn_shortcut (BatchNorm (None, 128, 68, 120) 512 block_2a_conv_shortcut[0][0]

add_3 (Add) (None, 128, 68, 120) 0 block_2a_bn_2[0][0]
block_2a_bn_shortcut[0][0]

block_2a_relu (Activation) (None, 128, 68, 120) 0 add_3[0][0]

block_2b_conv_1 (Conv2D) (None, 128, 68, 120) 147584 block_2a_relu[0][0]

block_2b_bn_1 (BatchNormalizati (None, 128, 68, 120) 512 block_2b_conv_1[0][0]

block_2b_relu_1 (Activation) (None, 128, 68, 120) 0 block_2b_bn_1[0][0]

block_2b_conv_2 (Conv2D) (None, 128, 68, 120) 147584 block_2b_relu_1[0][0]

block_2b_bn_2 (BatchNormalizati (None, 128, 68, 120) 512 block_2b_conv_2[0][0]

add_4 (Add) (None, 128, 68, 120) 0 block_2b_bn_2[0][0]
block_2a_relu[0][0]

block_2b_relu (Activation) (None, 128, 68, 120) 0 add_4[0][0]

block_3a_conv_1 (Conv2D) (None, 256, 34, 60) 295168 block_2b_relu[0][0]

block_3a_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3a_conv_1[0][0]

block_3a_relu_1 (Activation) (None, 256, 34, 60) 0 block_3a_bn_1[0][0]

block_3a_conv_2 (Conv2D) (None, 256, 34, 60) 590080 block_3a_relu_1[0][0]

block_3a_conv_shortcut (Conv2D) (None, 256, 34, 60) 33024 block_2b_relu[0][0]

block_3a_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3a_conv_2[0][0]

block_3a_bn_shortcut (BatchNorm (None, 256, 34, 60) 1024 block_3a_conv_shortcut[0][0]

add_5 (Add) (None, 256, 34, 60) 0 block_3a_bn_2[0][0]
block_3a_bn_shortcut[0][0]

block_3a_relu (Activation) (None, 256, 34, 60) 0 add_5[0][0]

block_3b_conv_1 (Conv2D) (None, 256, 34, 60) 590080 block_3a_relu[0][0]

block_3b_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3b_conv_1[0][0]

block_3b_relu_1 (Activation) (None, 256, 34, 60) 0 block_3b_bn_1[0][0]

block_3b_conv_2 (Conv2D) (None, 256, 34, 60) 590080 block_3b_relu_1[0][0]

block_3b_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3b_conv_2[0][0]

add_6 (Add) (None, 256, 34, 60) 0 block_3b_bn_2[0][0]
block_3a_relu[0][0]

block_3b_relu (Activation) (None, 256, 34, 60) 0 add_6[0][0]

block_4a_conv_1 (Conv2D) (None, 512, 34, 60) 1180160 block_3b_relu[0][0]

block_4a_bn_1 (BatchNormalizati (None, 512, 34, 60) 2048 block_4a_conv_1[0][0]

block_4a_relu_1 (Activation) (None, 512, 34, 60) 0 block_4a_bn_1[0][0]

block_4a_conv_2 (Conv2D) (None, 512, 34, 60) 2359808 block_4a_relu_1[0][0]

block_4a_conv_shortcut (Conv2D) (None, 512, 34, 60) 131584 block_3b_relu[0][0]

block_4a_bn_2 (BatchNormalizati (None, 512, 34, 60) 2048 block_4a_conv_2[0][0]

block_4a_bn_shortcut (BatchNorm (None, 512, 34, 60) 2048 block_4a_conv_shortcut[0][0]

add_7 (Add) (None, 512, 34, 60) 0 block_4a_bn_2[0][0]
block_4a_bn_shortcut[0][0]
__________________________________________________________________________________________________ [142/319]
block_4a_relu (Activation) (None, 512, 34, 60) 0 add_7[0][0]

block_4b_conv_1 (Conv2D) (None, 512, 34, 60) 2359808 block_4a_relu[0][0]

block_4b_bn_1 (BatchNormalizati (None, 512, 34, 60) 2048 block_4b_conv_1[0][0]

block_4b_relu_1 (Activation) (None, 512, 34, 60) 0 block_4b_bn_1[0][0]

block_4b_conv_2 (Conv2D) (None, 512, 34, 60) 2359808 block_4b_relu_1[0][0]

block_4b_bn_2 (BatchNormalizati (None, 512, 34, 60) 2048 block_4b_conv_2[0][0]

add_8 (Add) (None, 512, 34, 60) 0 block_4b_bn_2[0][0]
block_4a_relu[0][0]

block_4b_relu (Activation) (None, 512, 34, 60) 0 add_8[0][0]

output_bbox (Conv2D) (None, 16, 34, 60) 8208 block_4b_relu[0][0]

output_cov (Conv2D) (None, 4, 34, 60) 2052 block_4b_relu[0][0]

==================================================================================================
Total params: 11,205,588
Trainable params: 11,195,860
Non-trainable params: 9,728

2020-09-03 15:48:51,341 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-09-03 15:48:51,341 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-09-03 15:48:51,341 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-09-03 15:48:51,341 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2020-09-03 15:48:51,341 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 6814, number of sources: 1, batch size per gpu: 24, steps: 284
2020-09-03 15:48:51,470 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2020-09-03 15:48:51.511818: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:48:51.512669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
2020-09-03 15:48:51.512726: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-09-03 15:48:51.512815: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-09-03 15:48:51.512923: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-09-03 15:48:51.512989: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-09-03 15:48:51.513056: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-09-03 15:48:51.513122: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-09-03 15:48:51.513181: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-09-03 15:48:51.513312: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:48:51.514188: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:48:51.514956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
2020-09-03 15:48:51,811 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 1
2020-09-03 15:48:51,819 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-09-03 15:48:51,820 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2020-09-03 15:48:52.201543: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-09-03 15:48:52,483 [INFO] iva.detectnet_v2.scripts.train: Found 6814 samples in training set
2020-09-03 15:48:55,919 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-09-03 15:48:55,919 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-09-03 15:48:55,919 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-09-03 15:48:55,919 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2020-09-03 15:48:55,919 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 9875, number of sources: 1, batch size per gpu: 24, steps: 412
2020-09-03 15:48:55,959 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2020-09-03 15:48:56,283 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2020-09-03 15:48:56,291 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-09-03 15:48:56,291 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2020-09-03 15:48:56,742 [INFO] iva.detectnet_v2.scripts.train: Found 9875 samples in validation set [82/319]
2020-09-03 15:49:00.694744: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:49:00.695625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
2020-09-03 15:49:00.695694: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-09-03 15:49:00.695784: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-09-03 15:49:00.695867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-09-03 15:49:00.695929: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-09-03 15:49:00.695988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-09-03 15:49:00.696053: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-09-03 15:49:00.696105: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-09-03 15:49:00.696244: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:49:00.697224: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-03 15:49:00.698011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
2020-09-03 15:49:00.698051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-03 15:49:00.698078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-09-03 15:49:00.698103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-09-03 15:49:36.047342: E tensorflow/core/common_runtime/executor.cc:648] Executor failed to create kernel. Invalid argument: Conv2DCustomBackpropInputOp only supports NHWC.
*** [[{{node gradients/resnet18_nopool_bn_detectnet_v2/output_bbox/convolution_grad/Conv2DBackpropInput}}]]***
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Conv2DCustomBackpropInputOp only supports NHWC.[[{{node gradients/resnet18_nopool_bn_detectnet_v2/output_bbox/convolution_grad/Conv2DBackpropInput}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 624, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 149, in run_training_loop
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 754, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1360, in run
raise six.reraise(original_exc_info)
File “/usr/local/lib/python3.6/dist-packages/six.py”, line 693, in reraise
raise value
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1345, in run
return self._sess.run(args, kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1418, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1176, in run
return self._sess.run(args, kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call [22/319]
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Conv2DCustomBackpropInputOp only supports NHWC.
*** [[node gradients/resnet18_nopool_bn_detectnet_v2/output_bbox/convolution_grad/Conv2DBackpropInput (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]***

Original stack trace for ‘gradients/resnet18_nopool_bn_detectnet_v2/output_bbox/convolution_grad/Conv2DBackpropInput’:
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 599, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 454, in build_training_graph
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/model/detectnet_model.py”, line 583, in build_training_graph
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/train_op_generator.py”, line 59, in get_train_op
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/train_op_generator.py”, line 74, in _get_train_op_without_cost
_scaling
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/optimizer.py”, line 419, in minimize
grad_loss=grad_loss)
File “/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py”, line 253, in compute_gradients
gradients = self._optimizer.compute_gradients(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/optimizer.py”, line 537, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_impl.py”, line 158, in gradients
unconnected_gradients)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py”, line 703, in _GradientsHelper
lambda: grad_fn(op, *out_grads))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py”, line 362, in _MaybeCompile
return grad_fn() # Exit early
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py”, line 703, in
lambda: grad_fn(op, *out_grads))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_grad.py”, line 596, in _Conv2DGrad
data_format=data_format),
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py”, line 1407, in conv2d_backprop_input
name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

…which was originally created as op ‘resnet18_nopool_bn_detectnet_v2/output_bbox/convolution’, defined at:
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
[elided 6 identical lines from previous traceback]
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 454, in build_training_graph
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/model/detectnet_model.py”, line 557, in build_training_graph
File “/usr/local/lib/python3.6/dist-packages/keras/engine/base_layer.py”, line 457, in call
output = self.call(inputs, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/network.py”, line 564, in call
output_tensors, _, _ = self.run_internal_graph(inputs, masks)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/network.py”, line 721, in run_internal_graph
layer.call(computed_tensor, **kwargs))
File “/usr/local/lib/python3.6/dist-packages/keras/layers/convolutional.py”, line 171, in call
dilation_rate=self.dilation_rate)
File “/opt/nvidia/third_party/keras/tensorflow_backend.py”, line 102, in conv2d
data_format=tf_data_format,
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py”, line 921, in convolution
name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py”, line 1032, in convolution_internal
name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py”, line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

Morganh · September 6, 2020, 3:08pm

Is your dataset CHW or HWC format, rgb or bgr?
Could you please try to narrow down with a small part of your training dataset?

seby1 · September 9, 2020, 5:42pm

All of my images are in RGB and in the kitti-format as recommended by TLT for training a detectnet on a custom dataset.

I used tlt-dataset-convert -d conversion_spec_file_trainval.txt -o /workspace/kitti_data/train/train_tfrecords as instructed here

The conversion_spec_file_trainval.txt is : conversion_spec_file_trainval.txt (230 Bytes)

All the training images are saved in the RGB format (544,960,3) in kitti_data/train/images/.

I see that the tfrecords I create through this command only point to the image locations.
I however do not understand why the tlt-train command decides to load the images in the nchw format instead of nhwc.

Morganh · September 10, 2020, 2:57am

Could you try to use public KITTI dataset to check firstly?

seby1 · September 10, 2020, 2:42pm

I figured out the issue. The dataset was being read in as NCHW instead of NHWC because the GPU on the AWS instance I was using was CUDA COMPUTE < 5.2. This lead the training script to ignore the GPU and try to use the CPU instead. Shifting to a beefier GPU and creating the results directory new everytime solved the issue.

abhigoku10 · July 2, 2021, 10:37am

How to change the NCHW to NHWC which file should i modify ??

Topic		Replies	Views
Error wile using TLT pretrained model tlt_semantic_segmentation:resnet101 TAO Toolkit	7	604	August 27, 2021
Failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error TAO Toolkit	5	9968	October 12, 2021
Tlt3.0 retrain trafficcamnet getting the error when do the evaluation TAO Toolkit	31	1638	October 12, 2021
ValueError: No dataset tfrecords file found at path TAO Toolkit	10	1674	October 12, 2021
TLT DetectnetV2, Problem (Solved! -> RTX 3070 not supported by tlt 2.0_py3) TAO Toolkit	12	673	October 12, 2021
Run detectnet_v2.ipynb error with my own data TAO Toolkit tao	23	1409	March 4, 2022
Tensor reshape error when evaluating TrafficCamNet TAO Toolkit tensorflow	14	1033	August 20, 2023
Tlt unet evaluate failed TAO Toolkit	10	514	September 18, 2021
Out Of Memory Error While Training Peoplenet Model TAO Toolkit	2	448	March 8, 2022
Training with TLT a detectnet_v2 resnet18 pre-trained model failed TAO Toolkit	2	615	October 12, 2021

TLT Detectnet TrafficCamNet training not working

Here is my training config file:

Related topics