Error in detectnet_v2 - 10. Model Export

Hello -

I am running a virtual machine to train my models/run jupyter-notebook. In going through TLT - CV Training, I have trained my model, pruned the model, and retrained off of the pruned model. After going through steps 1 - 10, I shut my virtual machine down. Came back later to then run the 10. A Int8 Optimization.

For step 10. Model Export (before shutting my machine down) I was receiving the following error:

!mkdir -p $LOCAL_EXPERIMENT_DIR/experiment_dir_final
# Removing a pre-existing copy of the etlt if there has been any.
import os
output_file=os.path.join(os.environ['LOCAL_EXPERIMENT_DIR'],
                         "experiment_dir_final/resnet18_detector.etlt")
if os.path.exists(output_file):
    os.system("rm {}".format(output_file))
!tlt detectnet_v2 export \
                  -m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt \
                  -o $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector.etlt \
                  -k $KEY


2021-04-20 13:01:23,001 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/export.py", line 12, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py", line 198, in launch_export
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py", line 155, in run_export
AssertionError: Default output file /workspace/tlt-experiments/detectnet_v2/experiment_dir_final/resnet18_detector.etlt already exists
Traceback (most recent call last):
  File "/usr/local/bin/detectnet_v2", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/entrypoint/detectnet_v2.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.
2021-04-20 13:01:34,267 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

After shutting the machine down, restarting it, and running the same command, I am now seeing this error for the same 10. Model Export commands:

!mkdir -p $LOCAL_EXPERIMENT_DIR/experiment_dir_final
# Removing a pre-existing copy of the etlt if there has been any.
import os
output_file=os.path.join(os.environ['LOCAL_EXPERIMENT_DIR'],
                         "experiment_dir_final/resnet18_detector.etlt")
if os.path.exists(output_file):
    os.system("rm {}".format(output_file))
!tlt detectnet_v2 export \
                  -m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt \
                  -o $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector.etlt \
                  -k $KEY


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-eab212de10ab> in <module>
      2 # Removing a pre-existing copy of the etlt if there has been any.
      3 import os
----> 4 output_file=os.path.join(os.environ['LOCAL_EXPERIMENT_DIR'],
      5                          "experiment_dir_final/resnet18_detector.etlt")
      6 if os.path.exists(output_file):

/usr/lib/python3.7/os.py in __getitem__(self, key)
    677         except KeyError:
    678             # raise KeyError with the original key value
--> 679             raise KeyError(key) from None
    680         return self.decodevalue(value)
    681 

KeyError: 'LOCAL_EXPERIMENT_DIR'

Any ideas as to what is going on and how to solve this issue?

Thanks,
Bryan

P.s. I am a noob at this, so, any/all help is much appreciated. Getting to this point took a lot of effort and work. Lots of configuration to get iPython, Jupyter, etc. all to work nicely together. I’d love to solve this and move on with putting the models on my Nano.

For your first error as above, the log already prompt the hint. Please also follow the guide in notebook “# Removing a pre-existing copy of the etlt if there has been any” .

For the 2nd error, the ‘LOCAL_EXPERIMENT_DIR’ is not defined. Please run previous cell to set the env.

For error 1, there is no documentation in the notebook regarding removing a pre-existing copy of the etlt file. I did, however, change the directory name (so I wouldn’t lose the data) to something else. But still ran into an error when running this command again.

For the second error, the previous cell is to 9. visualize the inferences, before that to 8. evaluate the retrained model. I went back to Evaluate the retrained model, and am receiving the error:

Detectnet_v2 evaluate: error: argument -k/—key; expected on argument
...... tlt.components.docker_handler.docker.handler: Stoping container.

Everything downstream from here errors out as well. So, it appears to be this -k/—Key component, but I am a bit lost with this

Please see above, the notebook mentioned that. Please remove the .etlt file if it is already available.
For previous cell, sorry for the confusion, what I mean is that after rebooting your machine, your env variables is lost, so you need to run section 0 of the notebook to setup env eariables again.