Hello,
My problem is about an error in the notebook of bodyposenet (see these articles 1 & 2)
- Hardware: GV100GL (Tesla V100 32Gb)
- Network: BodyposeNet
- TAO Version (docker_tag): v3.21.08-py3
- Training spec file: I used the default spec file in the notebook
- How to reproduce the issue: Same as above, I ran the commands from the notebook provided. The problem start at step 9.3 (which is Int8 Optimization) at this command:
output_file = os.path.join(os.getenv("LOCAL_EXPERIMENT_DIR"), "models/exp_m1_final/bpnet_model.etlt")
# NOTE: If you are trying to re-run calibration, please remove the calibration table (cal_cache_file).
# If you are trying to re-generate calibration data, please remove cal_data_file as well.
if os.path.exists(output_file):
os.system("rm {}".format(output_file))
!tao bpnet export \
-m $USER_EXPERIMENT_DIR/models/exp_m1_retrain/$RETRAIN_MODEL_CHECKPOINT \
-o $USER_EXPERIMENT_DIR/models/exp_m1_final/bpnet_model.etlt \
-k $KEY \
-d $IN_HEIGHT,$IN_WIDTH,$IN_CHANNELS \
-e $SPECS_DIR/bpnet_retrain_m1_coco.yaml \
-t tfonnx \
--data_type int8 \
--engine_file $USER_EXPERIMENT_DIR/models/exp_m1_final/bpnet_model.$IN_HEIGHT.$IN_WIDTH.int8.engine \
--cal_image_dir $USER_EXPERIMENT_DIR/data/calibration_samples/ \
--cal_cache_file $USER_EXPERIMENT_DIR/models/exp_m1_final/calibration.$IN_HEIGHT.$IN_WIDTH.bin \
--cal_data_file $USER_EXPERIMENT_DIR/models/exp_m1_final/coco.$IN_HEIGHT.$IN_WIDTH.tensorfile \
--batch_size 1 \
--batches $NUM_CALIB_SAMPLES \
--max_batch_size 1 \
--data_format channels_last
Here is the logs:
2021-10-29 11:17:47,683 [INFO] root: Registry: ['nvcr.io']
2021-10-29 11:17:47,781 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/username/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2021-10-29 09:17:49.914932: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.
/usr/local/lib/python3.6/dist-packages/numba/cuda/envvars.py:17: NumbaWarning:
Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_NVVM=/usr/local/cuda/nvvm/lib64/libnvvm.so.
For more information about alternatives visit: ('http://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')
warnings.warn(errors.NumbaWarning(msg))
/usr/local/lib/python3.6/dist-packages/numba/cuda/envvars.py:17: NumbaWarning:
Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_LIBDEVICE=/usr/local/cuda/nvvm/libdevice/.
For more information about alternatives visit: ('http://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')
warnings.warn(errors.NumbaWarning(msg))
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/export.py", line 236, in <module>
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/export.py", line 232, in main
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/export.py", line 194, in run_export
AssertionError: Default output file /workspace/tao-experiments/bpnet/models/exp_m1_final/bpnet_model.etlt already exists
Traceback (most recent call last):
File "/usr/local/bin/bpnet", line 8, in <module>
sys.exit(main())
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/entrypoint/bpnet.py", line 12, in main
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/common/entrypoint/entrypoint.py", line 300, in launch_job
AssertionError: Process run failed.
2021-10-29 11:17:58,043 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
I woulk like to understand the error as the errors seems to be from the python scripts launched by the command.
Thank you!
BR,
Thomas