Int8 Optimization on BodyPose Net fails

thomas.gigout · October 29, 2021, 9:34am

Hello,

My problem is about an error in the notebook of bodyposenet (see these articles 1 & 2)

Hardware: GV100GL (Tesla V100 32Gb)
Network: BodyposeNet
TAO Version (docker_tag): v3.21.08-py3
Training spec file: I used the default spec file in the notebook
How to reproduce the issue: Same as above, I ran the commands from the notebook provided. The problem start at step 9.3 (which is Int8 Optimization) at this command:

output_file = os.path.join(os.getenv("LOCAL_EXPERIMENT_DIR"), "models/exp_m1_final/bpnet_model.etlt")
# NOTE: If you are trying to re-run calibration, please remove the calibration table (cal_cache_file).
# If you are trying to re-generate calibration data, please remove cal_data_file as well.

if os.path.exists(output_file):
    os.system("rm {}".format(output_file))

!tao bpnet export \
    -m $USER_EXPERIMENT_DIR/models/exp_m1_retrain/$RETRAIN_MODEL_CHECKPOINT \
    -o $USER_EXPERIMENT_DIR/models/exp_m1_final/bpnet_model.etlt \
    -k $KEY \
    -d $IN_HEIGHT,$IN_WIDTH,$IN_CHANNELS \
    -e $SPECS_DIR/bpnet_retrain_m1_coco.yaml \
    -t tfonnx \
    --data_type int8 \
    --engine_file $USER_EXPERIMENT_DIR/models/exp_m1_final/bpnet_model.$IN_HEIGHT.$IN_WIDTH.int8.engine \
    --cal_image_dir $USER_EXPERIMENT_DIR/data/calibration_samples/ \
    --cal_cache_file $USER_EXPERIMENT_DIR/models/exp_m1_final/calibration.$IN_HEIGHT.$IN_WIDTH.bin  \
    --cal_data_file $USER_EXPERIMENT_DIR/models/exp_m1_final/coco.$IN_HEIGHT.$IN_WIDTH.tensorfile \
    --batch_size 1 \
    --batches $NUM_CALIB_SAMPLES \
    --max_batch_size 1 \
    --data_format channels_last

Here is the logs:

2021-10-29 11:17:47,683 [INFO] root: Registry: ['nvcr.io']
2021-10-29 11:17:47,781 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/username/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2021-10-29 09:17:49.914932: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

/usr/local/lib/python3.6/dist-packages/numba/cuda/envvars.py:17: NumbaWarning: 
Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_NVVM=/usr/local/cuda/nvvm/lib64/libnvvm.so.

For more information about alternatives visit: ('http://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')
  warnings.warn(errors.NumbaWarning(msg))
/usr/local/lib/python3.6/dist-packages/numba/cuda/envvars.py:17: NumbaWarning: 
Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_LIBDEVICE=/usr/local/cuda/nvvm/libdevice/.

For more information about alternatives visit: ('http://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')
  warnings.warn(errors.NumbaWarning(msg))
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/export.py", line 236, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/export.py", line 232, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/export.py", line 194, in run_export
AssertionError: Default output file /workspace/tao-experiments/bpnet/models/exp_m1_final/bpnet_model.etlt already exists
Traceback (most recent call last):
  File "/usr/local/bin/bpnet", line 8, in <module>
    sys.exit(main())
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/entrypoint/bpnet.py", line 12, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/common/entrypoint/entrypoint.py", line 300, in launch_job
AssertionError: Process run failed.
2021-10-29 11:17:58,043 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I woulk like to understand the error as the errors seems to be from the python scripts launched by the command.

Thank you!
BR,
Thomas

Morganh · October 29, 2021, 2:54pm

You can remove the etlt file and retry.
! tao bpnet run rm /workspace/tao-experiments/bpnet/models/exp_m1_final/bpnet_model.etlt

thomas.gigout · November 2, 2021, 8:01am

Hi @Morganh,

Thanks for your answer.

Is there an easier way than doing tao bpnet run (command) to enter the container created by tao? I don’t see any containers, even stopped, with docker container ls -a.

Thank you

Morganh · November 2, 2021, 8:44am

If there is a running docker , then you can check with
$ docker ps
then
$ docker exec -it <CONTAINER ID > /bin/bash

thomas.gigout · November 2, 2021, 9:18am

Yes, but that’s the problem, I don’t have any running containers, or even stopped containers. I run the commands from the bpnet notebook. Everything is fine before part 9 (Model Export and INT8 Quantization), but after that, !docker container ls or docker container ls -a or !docker ps -a only show the hello-world container, which is weird right?

Morganh · November 2, 2021, 9:25am

Can you run in the terminal instead of jupyter notebook?

In terminal, trigger a docker as below.
$ tao bpnet run /bin/bash
then exit it.
root@e9771ae2bef7:/workspace# exit

Then
$ docker ps
$ docker exec -it <CONTAINER ID > /bin/bash

thomas.gigout · November 2, 2021, 9:34am

I can’t, I don’t know if it is because of the company proxy or something else but I get a DockerEception Permission Denied

Traceback (most recent call last):
File "/home/user/Envs/tao/lib/python3.6/site-packages/urllib3/connectionpool.py", line 677, in urlopen
chunked=chunked,
File "/home/user/Envs/tao/lib/python3.6/site-packages/urllib3/connectionpool.py", line 392, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.6/http/client.py", line 1281, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
self.send(msg)
File "/usr/lib/python3.6/http/client.py", line 980, in send
self.connect()
File "/home/user/Envs/tao/lib/python3.6/site-packages/docker/transport/unixconn.py", line 43, in connect
sock.connect(self.unix_socket)
PermissionError: [Errno 13] Permission denied

Also, all the environment variables are stored in terminal session of the notebook as I ran everything from it

Morganh · November 2, 2021, 9:38am

Can you refer to Error runnning the facenet notebook to check if it can help? Its log is similar to yours.

thomas.gigout · November 2, 2021, 9:51am

Alright, docker.socket fixed and the command from the notebook ran well. Some other commands after it fail but easy to fix (output directory contains files etc).

Anyway, Thank you !

system · November 16, 2021, 9:52am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.