LPRNet Error

• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) : 5.3.0

I’m getting following error while using tao train

WARNING: From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/lprnet/scripts/train.py:82: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/lprnet/scripts/train.py", line 366, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/lprnet/scripts/train.py", line 362, in main
    raise e
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/lprnet/scripts/train.py", line 345, in main
    run_experiment(config_path=args.experiment_spec_file,
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/lprnet/scripts/train.py", line 86, in run_experiment
    os.makedirs(results_dir)
  File "/usr/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  [Previous line repeated 3 more times]
  File "/usr/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/home/mainak'
Execution status: FAIL
2024-06-03 15:32:19,782 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

here’s the detailed tao tool kit info:

task_group:         
    model:             
        dockers:                 
            nvidia/tao/tao-toolkit:                     
                5.0.0-tf2.11.0:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. classification_tf2
                        2. efficientdet_tf2
                5.0.0-tf1.15.5:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. bpnet
                        2. classification_tf1
                        3. converter
                        4. detectnet_v2
                        5. dssd
                        6. efficientdet_tf1
                        7. faster_rcnn
                        8. fpenet
                        9. lprnet
                        10. mask_rcnn
                        11. multitask_classification
                        12. retinanet
                        13. ssd
                        14. unet
                        15. yolo_v3
                        16. yolo_v4
                        17. yolo_v4_tiny
                5.3.0-pyt:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. action_recognition
                        2. centerpose
                        3. deformable_detr
                        4. dino
                        5. mal
                        6. ml_recog
                        7. ocdnet
                        8. ocrnet
                        9. optical_inspection
                        10. pointpillars
                        11. pose_classification
                        12. re_identification
                        13. visual_changenet
                        14. classification_pyt
                        15. segformer
    dataset:             
        dockers:                 
            nvidia/tao/tao-toolkit:                     
                5.3.0-data-services:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. augmentation
                        2. auto_label
                        3. annotations
                        4. analytics
    deploy:             
        dockers:                 
            nvidia/tao/tao-toolkit:                     
                5.3.0-deploy:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. visual_changenet
                        2. centerpose
                        3. classification_pyt
                        4. classification_tf1
                        5. classification_tf2
                        6. deformable_detr
                        7. detectnet_v2
                        8. dino
                        9. dssd
                        10. efficientdet_tf1
                        11. efficientdet_tf2
                        12. faster_rcnn
                        13. lprnet
                        14. mask_rcnn
                        15. ml_recog
                        16. multitask_classification
                        17. ocdnet
                        18. ocrnet
                        19. optical_inspection
                        20. retinanet
                        21. segformer
                        22. ssd
                        23. trtexec
                        24. unet
                        25. yolo_v3
                        26. yolo_v4
                        27. yolo_v4_tiny
format_version: 3.0
toolkit_version: 5.3.0
published_date: 03/14/2024

here’s the tao_mounts.json:

{
    "Mounts": [
        {
            "source": "/home/mainak/ms/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/lprnet",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/home/mainak/ms/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/lprnet/specs",
            "destination": "/workspace/tao-experiments/lprnet/specs"
        }
    ],
    "DockerOptions": {
        "user": "1000:1000"
    }
}

I run using:

tao model lprnet train --gpus=1 -e /home/mainak/ms/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/lprnet/specs/tutorial_spec.txt -k nvidia_tlt -r /home/mainak/ms/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/lprnet/experiment_dir_unpruned -m /home/mainak/ms/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/lprnet/lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt

Any help is highly appreciated
@Morganh

The path should be a path inside the docker. That means, the path defined in “destination” of the tao_mounts.json file.

I’m sorry for being naive but in the ipynb file it’s given as:

The following notebook requires the user to set an env variable called the $LOCAL_PROJECT_DIR as the path to the users workspace. Please note that the dataset to run this notebook is expected to reside in the $LOCAL_PROJECT_DIR/data, while the TAO experiment generated collaterals will be output to $LOCAL_PROJECT_DIR/lprnet.

!tao model lprnet train --gpus=1 --gpu_index=$GPU_INDEX \
                  -e $SPECS_DIR/tutorial_spec.txt \
                  -k $KEY \
                  -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                  -m $USER_EXPERIMENT_DIR/pretrained_lprnet_baseline18/lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt

and

%env USER_EXPERIMENT_DIR=/workspace/tao-experiments/lprnet

This USER_EXPERIMENT_DIR is path to my local system or docker? . Can you please elaborate?

In tao_tutorials/notebooks/tao_launcher_starter_kit/lprnet/lprnet.ipynb at main · NVIDIA/tao_tutorials · GitHub, the USER_EXPERIMENT_DIR is a path inside the docker.
You can also check tao_mounts.json file as well. It mounts local source to docker’s destination.

            {
                "source": os.environ["LOCAL_PROJECT_DIR"],
                "destination": "/workspace/tao-experiments"
            },

The “destination” is a path inside the docker.

@Morganh
Hi,
I closed the topic as that particular problem was solved. However when I migrated to ec2 instance same approach gives me the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/lprnet/scripts/train.py", line 366, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/lprnet/scripts/train.py", line 362, in main
    raise e
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/lprnet/scripts/train.py", line 345, in main
    run_experiment(config_path=args.experiment_spec_file,
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/lprnet/scripts/train.py", line 89, in run_experiment
    status_logging.StatusLogger(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/logging/logging.py", line 203, in __init__
    self.l_file = open(self.log_path, "a" if append else "w")
PermissionError: [Errno 13] Permission denied: '/workspace/tao-experiments/lprnet/experiment_dir_unpruned/status.json'
Execution status: FAIL
2024-06-05 05:17:01,558 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Any help is highly appreciated.

I run using

tao model lprnet train --gpus=1 --gpu_index=0 -e /workspace/tao-experiments/lprnet/specs/tutorial_spec.txt -k nvidia_tlt -r /workspace/tao-experiments/lprnet/experiment_dir_unpruned -m /workspace/tao-experiments/lprnet/pretrained_lprnet_baseline18/lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt

Please check if it works after removing above.

Actually after installing tao-toolkit by the below step:

# first remove the old ones
sudo apt remove --purge nvidia-container-toolkit
sudo apt update
sudo apt autoremove

# check version availability
apt list -a "*nvidia-container-toolkit*"
# install 1.14.0-1
apt install nvidia-container-toolkit=1.14.0-1 nvidia-container-toolkit-base=1.14.0-1

when I check using :

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

it gives me the following error:

docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.

Please install nvidia-docker.

$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2
$ sudo pkill -SIGHUP dockerd
$ sudo systemctl restart docker.service

I did follow the steps. However the error still persists.

ubuntu@ip-172-31-7-134:~/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/lprnet$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
OK
ubuntu@ip-172-31-7-134:~/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/lprnet$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
ubuntu@ip-172-31-7-134:~/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/lprnet$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
> sudo tee /etc/apt/sources.list.d/nvidia-docker.list
deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
#deb https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/$(ARCH) /
#deb https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/ubuntu18.04/$(ARCH) /
ubuntu@ip-172-31-7-134:~/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/lprnet$ sudo apt-get update
Get:1 file:/var/nv-tensorrt-repo-ubuntu2004-cuda11.4-trt8.2.5.1-ga-20220505  InRelease
Ign:1 file:/var/nv-tensorrt-repo-ubuntu2004-cuda11.4-trt8.2.5.1-ga-20220505  InRelease
Get:2 file:/var/nv-tensorrt-repo-ubuntu2004-cuda11.4-trt8.2.5.1-ga-20220505  Release [569 B]
Hit:3 http://ap-south-1.ec2.archive.ubuntu.com/ubuntu focal InRelease
Get:2 file:/var/nv-tensorrt-repo-ubuntu2004-cuda11.4-trt8.2.5.1-ga-20220505  Release [569 B]                                                                         
Get:4 http://ap-south-1.ec2.archive.ubuntu.com/ubuntu focal-updates InRelease [128 kB]                                                                               
Hit:5 http://ap-south-1.ec2.archive.ubuntu.com/ubuntu focal-backports InRelease                                                                                      
Hit:6 https://nvidia.github.io/libnvidia-container/stable/deb/amd64  InRelease                                                                                       
Hit:7 https://nvidia.github.io/libnvidia-container/experimental/deb/amd64  InRelease                                                                                 
Get:8 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64  InRelease [1484 B]                                                                      
Hit:9 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  InRelease                                                                          
Hit:10 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64  InRelease                                                                                           
Hit:11 https://download.docker.com/linux/ubuntu focal InRelease                                                                  
Hit:12 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease                                     
Get:14 http://ap-south-1.ec2.archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [1190 kB]
Get:15 http://ap-south-1.ec2.archive.ubuntu.com/ubuntu focal-updates/universe Translation-en [286 kB]
Hit:16 http://security.ubuntu.com/ubuntu focal-security InRelease                 
Fetched 1606 kB in 1s (1975 kB/s)               
Reading package lists... Done
ubuntu@ip-172-31-7-134:~/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/lprnet$ sudo apt-get install -y nvidia-docker2
Reading package lists... Done
Building dependency tree       
Reading state information... Done
nvidia-docker2 is already the newest version (2.14.0-1).
0 upgraded, 0 newly installed, 0 to remove and 52 not upgraded.
ubuntu@ip-172-31-7-134:~/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/lprnet$ sudo pkill -SIGHUP dockerd
ubuntu@ip-172-31-7-134:~/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/lprnet$ sudo systemctl restart docker.service
ubuntu@ip-172-31-7-134:~/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/lprnet$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.
ubuntu@ip-172-31-7-134:~/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/lprnet$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.

I shared my previous steps in Run TAO training probelm - #30 by Morganh. You can refer to it to narrow down.

This works!!! However when I run :

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

The error is still there. How am I able to train?

I shared my previous steps in Run TAO training probelm - #30 by Morganh. You can refer to it to narrow down.

ok. I will surely check thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.