Train with my own tlt model #2

tmid0819 · December 22, 2021, 7:33am

Hi there, it’s me again.

I have trained a tlt model and I want to use it as a pretrained model to train a new model. And this time, I tried the latest version tlt3.0. The error that related to the number of object classes occur again. It seems that using tlt3.0 does not solve my problem.

This is my commands:

!ssd train --gpus 1 --gpu_index=$GPU_INDEX \
           -e $SPECS_DIR/ssd_train_resnet18_kitti_52class.txt \
           -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
           -k $KEY \
           -m $USER_EXPERIMENT_DIR/experiment_dir_pruned_38class_2/ssd_resnet18_pruned.tlt

And, here is my error log:

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

2021-12-22 07:05:54,156 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

2021-12-22 07:05:54,354 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

Epoch 1/2000
Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 313, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 309, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 237, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 217, in fit_generator
    class_weight=class_weight)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1217, in train_on_batch
    outputs = self.train_function(ins)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Incompatible shapes: [8,65382,39] vs. [8,65382,53]
	 [[{{node loss/ssd_predictions_loss/mul}}]]
	 [[loss/add_46/_5123]]
  (1) Invalid argument: Incompatible shapes: [8,65382,39] vs. [8,65382,53]
	 [[{{node loss/ssd_predictions_loss/mul}}]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
  File "/usr/local/bin/ssd", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/entrypoint/ssd.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.

Morganh · December 22, 2021, 7:42am

May I know which docker did you use?
How about
$ tlt info --verbose

or
$ tao info --verbose

Morganh · December 22, 2021, 7:44am

According to the log, I am afraid you are using 3.0-dp version or older.

Morganh · December 22, 2021, 7:45am

Please see TAO Toolkit for Computer Vision | NVIDIA NGC and use the latest.

tmid0819 · December 22, 2021, 7:50am

After checking, I am using this one:

nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3

Is there any effect of using this version? Or, is there any obvious difference from v3.21, causing these program errors?

Morganh · December 22, 2021, 7:51am

3.21.08 or 3.21.11 should work. There are some updates.
3.0-dp was released in Feb,2021

tmid0819 · December 22, 2021, 9:18am

When I tried to use Tao v3.21, I encountered a little problem.
This is the command used pull and run docker:

 sudo docker run --gpus all -it --name tlt3 -v "/home/ubuntu/tlt3_demo/":"/workspace/tlt3_demo" -p 8888:8888 --restart=always nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3
Unable to find image 'nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3

And this is what I run in jupyter notebook:

!tao ssd dataset_convert \
         -d $SPECS_DIR/ssd_tfrecords_kitti_train.txt \
         -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_train

Error log:

2021-12-22 08:58:36,448 [INFO] root: Registry: ['nvcr.io']
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 980, in send
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 727, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 980, in send
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 205, in _retrieve_server_version
    return self.version(api_version=False)["ApiVersion"]
  File "/usr/local/lib/python3.6/dist-packages/docker/api/daemon.py", line 181, in version
    return self._result(self._get(url), json=True)
  File "/usr/local/lib/python3.6/dist-packages/docker/utils/decorators.py", line 46, in inner
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 228, in _get
    return self.get(url, **self._set_request_timeout(kwargs))
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/tao", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/tlt/entrypoint/entrypoint.py", line 115, in main
    args[1:]
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 297, in launch_command
    docker_handler = self.handler_map[
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 152, in handler_map
    docker_mount_file=os.getenv("LAUNCHER_MOUNTS", DOCKER_MOUNT_FILE)
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/docker_handler/docker_handler.py", line 62, in __init__
    self._docker_client = docker.from_env()
  File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 85, in from_env
    timeout=timeout, version=version, **kwargs_from_env(**kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 40, in __init__
    self.api = APIClient(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 188, in __init__
    self._version = self._retrieve_server_version()
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 213, in _retrieve_server_version
    'Error while fetching server API version: {0}'.format(e)
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

I have found that the cause of the error may be triggering tlt docker based on one docker. But I don’t know what command I gave to cause it. Thanks for helping!

Morganh · December 22, 2021, 10:53am

Refer to Tlt augment not working - #2 by Morganh

tmid0819 · December 23, 2021, 1:52am

What about the folder that i want to mout, which is “/home/ubuntu/tlt3_demo/”:“/workspace/tlt3_demo”

Morganh · December 23, 2021, 2:04am

All the mapping are mentioned in your ~/.tao_mounts.json.

See more info in TAO Toolkit Launcher — TAO Toolkit 3.22.05 documentation

tmid0819 · December 27, 2021, 3:02am

Thanks, your method works.
But another problem followed. Besides, I couldn’t find the location of my .json.

The error is as follows:

~/.tao_mounts.json wasn't found. Falling back to obtain mount points and docker configs from ~/.tlt_mounts.json.
Please note that this will be deprecated going forward.
2021-12-23 06:28:36,520 [INFO] root: Registry: ['nvcr.io']
2021-12-23 06:28:36,635 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2021-12-23 06:28:36,930 [INFO] tlt.components.docker_handler.docker_handler: The required docker doesn't exist locally/the manifest has changed. Pulling a new docker.
2021-12-23 06:28:36,930 [INFO] tlt.components.docker_handler.docker_handler: Pulling the required container. This may take several minutes if you're doing this for the first time. Please wait here.
...
Pulling from repository: nvcr.io/nvidia/tao/tao-toolkit-tf
2021-12-23 06:36:44,088 [INFO] tlt.components.docker_handler.docker_handler: Container pull complete.
2021-12-23 06:36:44,089 [INFO] root: No mount points were found in the /root/.tlt_mounts.json file.
2021-12-23 06:36:44,089 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/root/.tlt_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/ssd", line 8, in <module>
    sys.exit(main())
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/entrypoint/ssd.py", line 12, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 256, in launch_job
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 47, in get_modules
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/export.py", line 10, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/export/ssd_exporter.py", line 30, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py", line 22, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py", line 27, in <module>
  File "/usr/local/lib/python3.6/dist-packages/pycuda/autoinit.py", line 5, in <module>
    cuda.init()
pycuda._driver.LogicError: cuInit failed: forward compatibility was attempted on non supported HW
2021-12-23 06:36:50,515 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Morganh · December 27, 2021, 3:24pm

Did you ever create ~/.tlt_mounts.json or ~/.tao_mounts.json ?

More, how did you meet above error? Can you share the detailed steps?

How did you trigger the docker? Use tao launcher or not?
The full command when you run into error.

tmid0819 · December 28, 2021, 3:25am

Thanks for answering.

I trigger the docker through typing the command above.
Than I use create a jupyter notebook and download the example.
I follow the step in the example " Object Detection using TAO SSD"
In the first block “Set up env variables and map drives” I run the following cod:

# Mapping up the local directories to the TAO docker.
import json
mounts_file = os.path.expanduser("~/.tao_mounts.json")

# Define the dictionary with the mapped drives
drive_map = {
    "Mounts": [
            # Mapping the data directory
            {
                "source": os.environ["LOCAL_PROJECT_DIR"],
                "destination": "/workspace/tao-experiments"
            },
            # Mapping the specs directory.
            {
                "source": os.environ["LOCAL_SPECS_DIR"],
                "destination": os.environ["SPECS_DIR"]
            },
        ],
    "DockerOptions": {
        "user": "{}:{}".format(os.getuid(), os.getgid())
    }
}

# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(drive_map, mfile, indent=4)

Than I install something I might need:

# SKIP this step IF you have already installed the TAO launcher.
!pip3 install nvidia-pyindex
!pip3 install nvidia-tao

Because, I already have my own dataset, so I skip to create tfrecords:

# Creating a new directory for the output tfrecords dump.
print("Converting the training set to TFRecords.")
!mkdir -p $LOCAL_DATA_DIR/tfrecords && rm -rf $LOCAL_DATA_DIR/tfrecords/*
!tao ssd dataset_convert \
         -d $SPECS_DIR/ssd_tfrecords_kitti_train.txt \
         -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_train

Than I meet the error.

Morganh · December 28, 2021, 12:00pm

Could you add “--runtime=nvidia” when you trigger docker?
Refer to an old version of TLT 2.0 user guide. https://docs.nvidia.com/tao/archive/tlt-20/tlt-user-guide/text/requirements_and_installation.html#running-the-transfer-learning-toolkit

tmid0819 · December 29, 2021, 5:37am

Thank you for your reply.
After adding the “–untime=nvidia”, a different error occurred when I ran the tfrecords program.
Log:

Converting the training set to TFRecords.
2021-12-29 05:32:26,430 [INFO] root: Registry: ['nvcr.io']
2021-12-29 05:32:26,563 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/ssd", line 8, in <module>
    sys.exit(main())
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/entrypoint/ssd.py", line 12, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 256, in launch_job
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 47, in get_modules
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/export.py", line 10, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/export/ssd_exporter.py", line 30, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py", line 22, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py", line 27, in <module>
  File "/usr/local/lib/python3.6/dist-packages/pycuda/autoinit.py", line 5, in <module>
    cuda.init()
pycuda._driver.LogicError: cuInit failed: forward compatibility was attempted on non supported HW
2021-12-29 05:32:30,447 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

By the way, this is my full command to trigger the docker now:

docker run --gpus all -it --name tao3 -v "/home/ubuntu/tlt3_demo/":"/workspace/tlt3_demo" -p 8888:8888 -v /var/run/docker.sock:/var/run/docker.sock --runtime=nvidia  --restart=always nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3

Morganh · December 29, 2021, 6:21am

May I know which gpu did you run on?

tmid0819 · December 29, 2021, 6:29am

Sure, I’m using GeForce RTX 2080 Ti.

Morganh · December 29, 2021, 6:30am

Please check if you are running on Ubuntu.

There is also a similar topic which has the same error.
No CUDA-capable device is detected on tao detectnet_v2 dataset convert - #4 by NilsAI . For that topic, adding the tag --privileged helped run without the issues.

tmid0819 · December 29, 2021, 8:22am

Thanks. I try to use the method you gave me in the article. I also remove Tao from the instruction.

!ssd dataset_convert \
         -d $SPECS_DIR/ssd_tfrecords_kitti_train.txt \
         -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_train

It works!
But why? What went wrong in the process that caused me to take it away?

Morganh · December 29, 2021, 8:27am

I see. You were triggering the docker via "docker run xxx ".
In this way, there is not needed to use “tao”.
See TAO Toolkit Launcher - NVIDIA Docs

For example, to train a detectnet_v2 model in the interactive session, run the following command after invoking an interactive session using tao detectnet_v2

detectnet_v2 train -e /path/to/experiment_spec.txt
-k
-r /path/to/train/output
–gpus

Topic		Replies	Views
Error in TAO-Toolkit while training TAO Toolkit	15	1513	July 6, 2022
Tao toolkit facenet Error TAO Toolkit	14	1282	March 7, 2022
Problem with tlt file mounting TAO Toolkit	29	2354	January 6, 2022
Error wile using TLT pretrained model tlt_semantic_segmentation:resnet101 TAO Toolkit	7	591	August 27, 2021
Error in Generating TFrecords for yolov4 TAO Toolkit	38	1227	May 17, 2022
TAO toolkit happend some .so bug TAO Toolkit tao	19	907	September 9, 2022
License Plate Recognition TAO Toolkit	14	1240	July 4, 2022
TLT V2.0 Classification TAO Toolkit	26	2787	August 3, 2021
Running tlt- docker.errors.DockerException: Error while fetching server API version TAO Toolkit	16	3679	August 28, 2021
Docker instantiation failed when running tao ssd TAO Toolkit	17	929	December 28, 2021

Train with my own tlt model #2

Related topics