Int8 Optimization on BodyPose Net fails

Hello,

My problem is about an error in the notebook of bodyposenet (see these articles 1 & 2)

  • Hardware: GV100GL (Tesla V100 32Gb)
  • Network: BodyposeNet
  • TAO Version (docker_tag): v3.21.08-py3
  • Training spec file: I used the default spec file in the notebook
  • How to reproduce the issue: Same as above, I ran the commands from the notebook provided. The problem start at step 9.3 (which is Int8 Optimization) at this command:
output_file = os.path.join(os.getenv("LOCAL_EXPERIMENT_DIR"), "models/exp_m1_final/bpnet_model.etlt")
# NOTE: If you are trying to re-run calibration, please remove the calibration table (cal_cache_file).
# If you are trying to re-generate calibration data, please remove cal_data_file as well.

if os.path.exists(output_file):
    os.system("rm {}".format(output_file))

!tao bpnet export \
    -m $USER_EXPERIMENT_DIR/models/exp_m1_retrain/$RETRAIN_MODEL_CHECKPOINT \
    -o $USER_EXPERIMENT_DIR/models/exp_m1_final/bpnet_model.etlt \
    -k $KEY \
    -d $IN_HEIGHT,$IN_WIDTH,$IN_CHANNELS \
    -e $SPECS_DIR/bpnet_retrain_m1_coco.yaml \
    -t tfonnx \
    --data_type int8 \
    --engine_file $USER_EXPERIMENT_DIR/models/exp_m1_final/bpnet_model.$IN_HEIGHT.$IN_WIDTH.int8.engine \
    --cal_image_dir $USER_EXPERIMENT_DIR/data/calibration_samples/ \
    --cal_cache_file $USER_EXPERIMENT_DIR/models/exp_m1_final/calibration.$IN_HEIGHT.$IN_WIDTH.bin  \
    --cal_data_file $USER_EXPERIMENT_DIR/models/exp_m1_final/coco.$IN_HEIGHT.$IN_WIDTH.tensorfile \
    --batch_size 1 \
    --batches $NUM_CALIB_SAMPLES \
    --max_batch_size 1 \
    --data_format channels_last

Here is the logs:

2021-10-29 11:17:47,683 [INFO] root: Registry: ['nvcr.io']
2021-10-29 11:17:47,781 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/username/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2021-10-29 09:17:49.914932: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

/usr/local/lib/python3.6/dist-packages/numba/cuda/envvars.py:17: NumbaWarning: 
Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_NVVM=/usr/local/cuda/nvvm/lib64/libnvvm.so.

For more information about alternatives visit: ('http://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')
  warnings.warn(errors.NumbaWarning(msg))
/usr/local/lib/python3.6/dist-packages/numba/cuda/envvars.py:17: NumbaWarning: 
Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_LIBDEVICE=/usr/local/cuda/nvvm/libdevice/.

For more information about alternatives visit: ('http://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')
  warnings.warn(errors.NumbaWarning(msg))
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/export.py", line 236, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/export.py", line 232, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/export.py", line 194, in run_export
AssertionError: Default output file /workspace/tao-experiments/bpnet/models/exp_m1_final/bpnet_model.etlt already exists
Traceback (most recent call last):
  File "/usr/local/bin/bpnet", line 8, in <module>
    sys.exit(main())
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/entrypoint/bpnet.py", line 12, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/common/entrypoint/entrypoint.py", line 300, in launch_job
AssertionError: Process run failed.
2021-10-29 11:17:58,043 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I woulk like to understand the error as the errors seems to be from the python scripts launched by the command.

Thank you!
BR,
Thomas

You can remove the etlt file and retry.
! tao bpnet run rm /workspace/tao-experiments/bpnet/models/exp_m1_final/bpnet_model.etlt

Hi @Morganh,

Thanks for your answer.

Is there an easier way than doing tao bpnet run (command) to enter the container created by tao? I don’t see any containers, even stopped, with docker container ls -a.

Thank you

If there is a running docker , then you can check with
$ docker ps
then
$ docker exec -it <CONTAINER ID > /bin/bash

1 Like

Yes, but that’s the problem, I don’t have any running containers, or even stopped containers. I run the commands from the bpnet notebook. Everything is fine before part 9 (Model Export and INT8 Quantization), but after that, !docker container ls or docker container ls -a or !docker ps -a only show the hello-world container, which is weird right?

Can you run in the terminal instead of jupyter notebook?

In terminal, trigger a docker as below.
$ tao bpnet run /bin/bash
then exit it.
root@e9771ae2bef7:/workspace# exit

Then
$ docker ps
$ docker exec -it <CONTAINER ID > /bin/bash

I can’t, I don’t know if it is because of the company proxy or something else but I get a DockerEception Permission Denied

Traceback (most recent call last):
File "/home/user/Envs/tao/lib/python3.6/site-packages/urllib3/connectionpool.py", line 677, in urlopen
chunked=chunked,
File "/home/user/Envs/tao/lib/python3.6/site-packages/urllib3/connectionpool.py", line 392, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.6/http/client.py", line 1281, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
self.send(msg)
File "/usr/lib/python3.6/http/client.py", line 980, in send
self.connect()
File "/home/user/Envs/tao/lib/python3.6/site-packages/docker/transport/unixconn.py", line 43, in connect
sock.connect(self.unix_socket)
PermissionError: [Errno 13] Permission denied

Also, all the environment variables are stored in terminal session of the notebook as I ran everything from it

Can you refer to Error runnning the facenet notebook to check if it can help? Its log is similar to yours.

Alright, docker.socket fixed and the command from the notebook ran well. Some other commands after it fail but easy to fix (output directory contains files etc).

Anyway, Thank you !

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.