Train with my own tlt model #2

Hi there, it’s me again.

I have trained a tlt model and I want to use it as a pretrained model to train a new model. And this time, I tried the latest version tlt3.0. The error that related to the number of object classes occur again. It seems that using tlt3.0 does not solve my problem.

This is my commands:

!ssd train --gpus 1 --gpu_index=$GPU_INDEX \
           -e $SPECS_DIR/ssd_train_resnet18_kitti_52class.txt \
           -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
           -k $KEY \
           -m $USER_EXPERIMENT_DIR/experiment_dir_pruned_38class_2/ssd_resnet18_pruned.tlt

And, here is my error log:

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

2021-12-22 07:05:54,156 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

2021-12-22 07:05:54,354 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

Epoch 1/2000
Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 313, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 309, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 237, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 217, in fit_generator
    class_weight=class_weight)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1217, in train_on_batch
    outputs = self.train_function(ins)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Incompatible shapes: [8,65382,39] vs. [8,65382,53]
	 [[{{node loss/ssd_predictions_loss/mul}}]]
	 [[loss/add_46/_5123]]
  (1) Invalid argument: Incompatible shapes: [8,65382,39] vs. [8,65382,53]
	 [[{{node loss/ssd_predictions_loss/mul}}]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
  File "/usr/local/bin/ssd", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/entrypoint/ssd.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.

May I know which docker did you use?
How about
$ tlt info --verbose

or
$ tao info --verbose

According to the log, I am afraid you are using 3.0-dp version or older.

Please see TAO Toolkit for CV | NVIDIA NGC and use the latest.

After checking, I am using this one:

nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3

Is there any effect of using this version? Or, is there any obvious difference from v3.21, causing these program errors?

3.21.08 or 3.21.11 should work. There are some updates.
3.0-dp was released in Feb,2021

When I tried to use Tao v3.21, I encountered a little problem.
This is the command used pull and run docker:

 sudo docker run --gpus all -it --name tlt3 -v "/home/ubuntu/tlt3_demo/":"/workspace/tlt3_demo" -p 8888:8888 --restart=always nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3
Unable to find image 'nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3

And this is what I run in jupyter notebook:

!tao ssd dataset_convert \
         -d $SPECS_DIR/ssd_tfrecords_kitti_train.txt \
         -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_train

Error log:

2021-12-22 08:58:36,448 [INFO] root: Registry: ['nvcr.io']
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 980, in send
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 727, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 980, in send
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 205, in _retrieve_server_version
    return self.version(api_version=False)["ApiVersion"]
  File "/usr/local/lib/python3.6/dist-packages/docker/api/daemon.py", line 181, in version
    return self._result(self._get(url), json=True)
  File "/usr/local/lib/python3.6/dist-packages/docker/utils/decorators.py", line 46, in inner
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 228, in _get
    return self.get(url, **self._set_request_timeout(kwargs))
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/tao", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/tlt/entrypoint/entrypoint.py", line 115, in main
    args[1:]
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 297, in launch_command
    docker_handler = self.handler_map[
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 152, in handler_map
    docker_mount_file=os.getenv("LAUNCHER_MOUNTS", DOCKER_MOUNT_FILE)
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/docker_handler/docker_handler.py", line 62, in __init__
    self._docker_client = docker.from_env()
  File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 85, in from_env
    timeout=timeout, version=version, **kwargs_from_env(**kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 40, in __init__
    self.api = APIClient(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 188, in __init__
    self._version = self._retrieve_server_version()
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 213, in _retrieve_server_version
    'Error while fetching server API version: {0}'.format(e)
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

I have found that the cause of the error may be triggering tlt docker based on one docker. But I don’t know what command I gave to cause it. Thanks for helping!

Refer to Tlt augment not working - #2 by Morganh

What about the folder that i want to mout, which is “/home/ubuntu/tlt3_demo/”:"/workspace/tlt3_demo"

All the mapping are mentioned in your ~/.tao_mounts.json.

See more info in TAO Toolkit Launcher — TAO Toolkit 3.21.11 documentation

Thanks, your method works.
But another problem followed. Besides, I couldn’t find the location of my .json.

The error is as follows:

~/.tao_mounts.json wasn't found. Falling back to obtain mount points and docker configs from ~/.tlt_mounts.json.
Please note that this will be deprecated going forward.
2021-12-23 06:28:36,520 [INFO] root: Registry: ['nvcr.io']
2021-12-23 06:28:36,635 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2021-12-23 06:28:36,930 [INFO] tlt.components.docker_handler.docker_handler: The required docker doesn't exist locally/the manifest has changed. Pulling a new docker.
2021-12-23 06:28:36,930 [INFO] tlt.components.docker_handler.docker_handler: Pulling the required container. This may take several minutes if you're doing this for the first time. Please wait here.
...
Pulling from repository: nvcr.io/nvidia/tao/tao-toolkit-tf
2021-12-23 06:36:44,088 [INFO] tlt.components.docker_handler.docker_handler: Container pull complete.
2021-12-23 06:36:44,089 [INFO] root: No mount points were found in the /root/.tlt_mounts.json file.
2021-12-23 06:36:44,089 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/root/.tlt_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/ssd", line 8, in <module>
    sys.exit(main())
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/entrypoint/ssd.py", line 12, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 256, in launch_job
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 47, in get_modules
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/export.py", line 10, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/export/ssd_exporter.py", line 30, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py", line 22, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py", line 27, in <module>
  File "/usr/local/lib/python3.6/dist-packages/pycuda/autoinit.py", line 5, in <module>
    cuda.init()
pycuda._driver.LogicError: cuInit failed: forward compatibility was attempted on non supported HW
2021-12-23 06:36:50,515 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Did you ever create ~/.tlt_mounts.json or ~/.tao_mounts.json ?

More, how did you meet above error? Can you share the detailed steps?

  • How did you trigger the docker? Use tao launcher or not?
  • The full command when you run into error.

Thanks for answering.

  1. I trigger the docker through typing the command above.
  2. Than I use create a jupyter notebook and download the example.
  3. I follow the step in the example " Object Detection using TAO SSD"
  4. In the first block “Set up env variables and map drives” I run the following cod:
# Mapping up the local directories to the TAO docker.
import json
mounts_file = os.path.expanduser("~/.tao_mounts.json")

# Define the dictionary with the mapped drives
drive_map = {
    "Mounts": [
            # Mapping the data directory
            {
                "source": os.environ["LOCAL_PROJECT_DIR"],
                "destination": "/workspace/tao-experiments"
            },
            # Mapping the specs directory.
            {
                "source": os.environ["LOCAL_SPECS_DIR"],
                "destination": os.environ["SPECS_DIR"]
            },
        ],
    "DockerOptions": {
        "user": "{}:{}".format(os.getuid(), os.getgid())
    }
}

# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(drive_map, mfile, indent=4)
  1. Than I install something I might need:
# SKIP this step IF you have already installed the TAO launcher.
!pip3 install nvidia-pyindex
!pip3 install nvidia-tao
  1. Because, I already have my own dataset, so I skip to create tfrecords:
# Creating a new directory for the output tfrecords dump.
print("Converting the training set to TFRecords.")
!mkdir -p $LOCAL_DATA_DIR/tfrecords && rm -rf $LOCAL_DATA_DIR/tfrecords/*
!tao ssd dataset_convert \
         -d $SPECS_DIR/ssd_tfrecords_kitti_train.txt \
         -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_train

Than I meet the error.

Could you add “--runtime=nvidia” when you trigger docker?
Refer to an old version of TLT 2.0 user guide. Requirements and Installation — Transfer Learning Toolkit 2.0 documentation

Thank you for your reply.
After adding the “–untime=nvidia”, a different error occurred when I ran the tfrecords program.
Log:

Converting the training set to TFRecords.
2021-12-29 05:32:26,430 [INFO] root: Registry: ['nvcr.io']
2021-12-29 05:32:26,563 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/ssd", line 8, in <module>
    sys.exit(main())
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/entrypoint/ssd.py", line 12, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 256, in launch_job
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 47, in get_modules
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/export.py", line 10, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/export/ssd_exporter.py", line 30, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py", line 22, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py", line 27, in <module>
  File "/usr/local/lib/python3.6/dist-packages/pycuda/autoinit.py", line 5, in <module>
    cuda.init()
pycuda._driver.LogicError: cuInit failed: forward compatibility was attempted on non supported HW
2021-12-29 05:32:30,447 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

By the way, this is my full command to trigger the docker now:

docker run --gpus all -it --name tao3 -v "/home/ubuntu/tlt3_demo/":"/workspace/tlt3_demo" -p 8888:8888 -v /var/run/docker.sock:/var/run/docker.sock --runtime=nvidia  --restart=always nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3

May I know which gpu did you run on?

Sure, I’m using GeForce RTX 2080 Ti.

Please check if you are running on Ubuntu.

There is also a similar topic which has the same error.
No CUDA-capable device is detected on tao detectnet_v2 dataset convert - #4 by NilsAI . For that topic, adding the tag --privileged helped run without the issues.

Thanks. I try to use the method you gave me in the article. I also remove Tao from the instruction.

!ssd dataset_convert \
         -d $SPECS_DIR/ssd_tfrecords_kitti_train.txt \
         -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_train

It works!
But why? What went wrong in the process that caused me to take it away?

I see. You were triggering the docker via "docker run xxx ".
In this way, there is not needed to use “tao”.
See TAO Toolkit Launcher — TAO Toolkit 3.21.11 documentation

For example, to train a detectnet_v2 model in the interactive session, run the following command after invoking an interactive session using tao detectnet_v2

detectnet_v2 train -e /path/to/experiment_spec.txt
-k
-r /path/to/train/output
–gpus