Error training segformer after running out of disk space: exec failed

In the tao segformer notebook, when running

!tao model segformer train \
                  -e $SPECS_DIR/train_mit_b5.yaml \
                  -r $RESULTS_DIR/isbi_experiment \
                  -g $NUM_GPUS

It first run out of disk space on the root drive. After cleaning up, including erasing all docker images wir docker container rm rebooted, and rerun the training and now get error

exec failed: unable to start container process: exec: “segformer”: executable file not found in $PATH: unknown

Complete results: >

Train SegFormer Model
2024-02-10 22:36:57,856 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-02-10 22:36:58,043 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.2.0-pyt1.14.0
2024-02-10 22:36:58,211 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/david/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
2024-02-10 22:36:58,211 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
OCI runtime exec failed: exec failed: unable to start container process: exec: “segformer”: executable file not found in $PATH: unknown
2024-02-10 22:36:59,469 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Trying to diagnose I run

docker run -it --rm --network=host nvcr.io/nvidia/tao/tao-toolkit:5.2.0-pyt1.14.0 /bin/bash

And get error

chmod: cannot access ‘/opt/ngccli/ngc’: No such file or directory

The complete docker run log
docker run.log (244.6 KB)

I found odd that at the begining of the docker run it says

 ngccli_reg_linux.zi 100%[===================>]  44.93M  34.8MB/s    in 1.3s    

2024-02-10 21:51:38 (34.8 MB/s) - ‘/opt/ngccli/ngccli_reg_linux.zip’ saved [47113663/47113663]

But in fact, directory /opt/ngccli doesn’t exist:

:/opt$ ls
containerd google microsoft nvidia ros

Thanks for the help

Please use nvcr.io/nvidia/tao/tao-toolkit:5.2.0.1-pyt1.14.0 instead.

Thanks @Morganh

Proceeded to train evaluate and inference successfully.

Working on C++ trt …

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.