Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc): RTX 4070 Laptop GPU
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc): Fan_hybrid_small
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
I have same questions regarding Tensorboard for visualizing training of “classification_pyt” (notebooks/tao_launcher_starter_kit):
Does the model (classification_pyt /Fan_hybrid_small) supports visualizing in TAO Toolkit?
In addition to tensorboard, it is also need to install the Tensorflow?
I have followed the guide for using wandb but raise the following ERROR:
env: EPOCHS=20
Train Classification Model
2024-12-20 18:45:04,558 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-12-20 18:45:04,619 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-12-20 18:45:04,707 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/eduardo/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
2024-12-20 18:45:04,707 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“driver failed programming external connectivity on endpoint nice_mirzakhani (6ee7493c8391314df6cb8b5f4030040d047feaebb67d3be575cecdb363f2f543): Error starting userland proxy: listen tcp4 0.0.0.0:8888: bind: address already in use”)
There is an existing port which is in use. Please check $docker ps. You can delete container($docker rm -fv container_id) and login again($docker run xxx).
Could you please open a terminal to check if it can work.
$ docker run --runtime=nvidia -it --rm -v /home/yourpath:/home/dockerpath nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt /bin/bash
Then it will run into the docker. An example to run training. # classification_pyt train xxx
===========================
=== TAO Toolkit PyTorch ===
NVIDIA Release 5.5.0-PyT (build 88113656)
TAO Toolkit Version 5.5.0
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 535.216.01 which has support for CUDA 12.2. This container
was built with CUDA 12.4 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See 1. Why CUDA Compatibility — CUDA Compatibility r555 documentation for details.
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for TAO Toolkit. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 …
root@cba3f7c905fb:/opt/nvidia/tools# ls
Jenkinsfile.develop Jenkinsfile.main README.md README.txt build.sh converter tao-converter
**Note : ** I can not see the example to run training (classification_py) in the container. Did I do something wrong in the command to run the docker? Do I need to replace …/yourpath:/…/dockerpath …?
Yes. Please use this way to map your local files into docker.
For example, if your previous working dir is /home/xxx/tao, then you can use below way. -v /home/xxx/tao:/home/xxx/tao
Then you can find all of your local files inside the docker under the same path.
This is the result of trying to run the jupyter notebook (classification_pyt) inside the docker container:
root@d2e0f9078bdc:/home/eduardo/Devel/tao_tutorials/notebooks/tao_launcher_starter_kit/classification_pyt# ls
cats_dogs_dataset.zip classification.ipynb specs
root@d2e0f9078bdc:/home/eduardo/Devel/tao_tutorials/notebooks/tao_launcher_starter_kit/classification_pyt# jupyter notebook --ip 0.0.0.0 --port 8888 --allow-root
[I 09:23:05.101 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 09:23:05.573 NotebookApp] jupyter_tensorboard extension loaded.
[W 09:23:05.573 NotebookApp] Error loading server extension jupyterlab
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/notebook/notebookapp.py”, line 2027, in init_server_extensions
mod = importlib.import_module(modulename)
File “/usr/lib/python3.10/importlib/init.py”, line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 1050, in _gcd_import
File “”, line 1027, in _find_and_load
File “”, line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named ‘jupyterlab’
[I 09:23:05.574 NotebookApp] [Jupytext Server Extension] NotebookApp.contents_manager_class is (a subclass of) jupytext.TextFileContentsManager already - OK
[I 09:23:05.577 NotebookApp] Serving notebooks from local directory: /home/eduardo/Devel/tao_tutorials/notebooks/tao_launcher_starter_kit/classification_pyt
[I 09:23:05.577 NotebookApp] Jupyter Notebook 6.4.10 is running at:
[I 09:23:05.577 NotebookApp] http://hostname:8888/?token=fcbe25a8f1995a8c24c868512128c6e1bf5ef46791368c6a
[I 09:23:05.577 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 09:23:05.578 NotebookApp]
To access the notebook, open this file in a browser:
file:///root/.local/share/jupyter/runtime/nbserver-338-open.html
Or copy and paste this URL:
http://hostname:8888/?token=fcbe25a8f1995a8c24c868512128c6e1bf5ef46791368c6a
I could not open the notebook:
This site can’t be reached
Check if there is a typo in hostname.
root@bf0f20fa5378:/home/eduardo/Devel/tao_tutorials/notebooks/tao_launcher_starter_kit/classification_pyt# classification_pyt train -e /home/eduardo/Devel/tao_tutorials/notebooks/tao_launcher_starter_kit/classification_pyt/specs/train_cats_dogs.yaml
Error merging ‘train_cats_dogs.yaml’ with schema
Key ‘wandb’ not in ‘TrainExpConfig’
full_key: train.wandb
reference_type=TrainExpConfig
object_type=TrainExpConfig
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
E1223 10:26:34.345000 138580502832256 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 1) local_rank: 0 (pid: 695) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 8, in
sys.exit(main())
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 347, in wrapper
return f(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 879, in main
run(args)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 870, in run
elastic_launch(
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
No, I mean for the classification_pyt network which is in tao_pytorch docker, currently it does not support visualizing images during training yet. This is a feature we are planning to do. But there is another network(odise) which supports wandb. If possible, you can implement wandb in classification_pyt by leveraging the code from odise’s wandb.