Hi, I am trying to fine-tune a resnet 18 model on a custom dataset. However, when I run the tlt-train command :
tlt-train detectnet_v2 -e /workspace/nvidia_tlt_models/tlt_pretrained_detectnet_v2_vresnet18/detectnet_v2_train_resnet18_kitti.txt -r /workspace/kitti_data/results -k tlt_encode
I am getting an error as follows:
==================================================================================================
2020-09-01 21:51:48.050217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
2020-09-01 21:51:48.050258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-01 21:51:48.050283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-09-01 21:51:48.050302: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-09-01 21:51:48.794194: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key block_1a_bn_1/beta/Adam not found in checkpoint
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key block_1a_bn_1/beta/Adam not found in checkpoint
[[{{node save/RestoreV2}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 1290, in restore
{self.saver_def.filename_tensor_name: save_path})
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key block_1a_bn_1/beta/Adam not found in checkpoint
[[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Original stack trace for ‘save/RestoreV2’:
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 624, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 147, in run_training_loop
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py”, line 143, in get_singular_monitored_session
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1104, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 727, in init
self._sess = self._coordinated_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 638, in create_session
self._scaffold.finalize()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 229, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 599, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 828, in init
self.build()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 840, in build
self._build(self._filename, build_save=True, build_restore=True) [58/360]
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 878, in _build
build_restore=build_restore)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 502, in _build_internal
restore_sequentially, reshape)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 381, in _AddShardedRestoreOps
name=“restore_shard”))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 328, in _AddRestoreOps
restore_sequentially)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_io_ops.py”, line 1696, in restore_v2
name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 1300, in restore
names_to_keys = object_graph_key_mapping(save_path)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 1618, in object_graph_key_mapping
object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py”, line 915, in get_tensor
return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 624, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 147, in run_training_loop
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py”, line 143, in get_singular_monitored_session
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1104, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 727, in init
self._sess = self._coordinated_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 647, in create_session
init_fn=self._scaffold.init_fn)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py”, line 290, in prepare_session
config=config)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py”, line 204, in _restore_checkpoint
saver.restore(sess, checkpoint_filename_with_path)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 1306, in restore
err, “a Variable name or other graph key that is missing”)
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint.
Original error:
Key block_1a_bn_1/beta/Adam not found in checkpoint
[[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Original stack trace for ‘save/RestoreV2’:
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 624, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 147, in run_training_loop
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py”, line 143, in get_singular_monitored_session
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1104, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 727, in init
self._sess = self._coordinated_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 638, in create_session
self._scaffold.finalize()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 229, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 599, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 828, in init
self.build()
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 840, in build
self._build(self._filename, build_save=True, build_restore=True)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 878, in _build
build_restore=build_restore)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 502, in _build_internal
restore_sequentially, reshape)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 381, in _AddShardedRestoreOps
name=“restore_shard”))
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 328, in _AddRestoreOps
restore_sequentially)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py”, line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_io_ops.py”, line 1696, in restore_v2
name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()
==================================================================================================
Here is my training config file:
detectnet_v2_train_resnet18_kitti.txt (6.2 KB)
I do not understand what I am doing wrong. Can someone please help me figure this out??