Facing error after training command

Docker_tag:–> v3.21.08-py3
Network Type → detectnet_v2
Training spec →
training_spec.txt (3.3 KB)

Hi, I am facing one error after typing training command : -

training command →

tao detectnet_v2 train -k tlt_encode -r /workspace/tao-experiments/results -e /workspace/tao-experiments/specs/training_spec.txt --gpu_index 1

error →

2022-02-25 16:14:53,358 [ERROR] tensorflow: ==================================
Object was never used (type <class ‘tensorflow.python.framework.ops.Tensor’>):
<tf.Tensor ‘IsVariableInitialized_308:0’ shape=() dtype=bool>
If you want to mark it as used call its “mark_used()” method.
It was originally created here:
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py”, line 143, in get_singular_monitored_session File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1104, in init
stop_grace_period_secs=stop_grace_period_secs) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 727, in init
self._sess = self._coordinated_creator.create_session() File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/hooks/hooks.py”, line 285, in begin File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/tf_should_use.py”, line 198, in wrapped
return _add_should_use_warning(fn(*args, **kwargs))

2022-02-25 21:44:54,373 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Did you ever run detectnet_v2 jupyter notebook? Is it successful?

no till now I haven’t run jupiter notebook. I am just trying to train using below training command : -

training command :-
tao detectnet_v2 train -k tlt_encode -r /workspace/tao-experiments/results -e /workspace/tao-experiments/specs/spec.txt --gpu_index 1

then I got the error which is in the previous comment.

Could you upload the full log as a file?

Hi, I think I have the same error here.

train_error.log (402.8 KB)

Maybe it is similar to Troubleshooting Guide — TAO Toolkit 3.22.05 documentation

Please try to train with a new result folder.

The link you sent does not seem to help. But some changes:
I created a new venv for the jupyter notebook (this time with Python3.6 instead of 3.8) and deleted what I understand is the results folder ($USER_EXPERIMENT_DIR/experiment_dir_unpruned).

Now I don’t see errors, but the train stops with no training at all.

train_error-p36-new_folder.log (46.6 KB)

"Illegal instruction (core dumped) "

Above error comes from old CPU. You can search and find similar topics in forum.

Ok, I see the topics talking about old CPU. I’ll try another host. Thanks.

Just one question, does this means that tao is running the dockers ‘outside’ nvidia container?
I thought all this stuff was running in GPU. (by the way this host has a GTX970)

No, tao is running with tao dockers. TAO Toolkit for Computer Vision | NVIDIA NGC

For "Illegal instruction (core dumped) ", the reason is as below.
Old CPUs were missing AVX2 instruction set.
See Core dumped on examples - #3 by Morganh

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.