Container Exit Immediately without any Error Message when Training/Evaluating Models

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) A100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) UNET
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
format_version: 1.0
tlt_version: 3.0
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi, I have an old TLT UNET model trained using above-mentioned version of TLT. When I try to redo the training today, every tlt command (tlt unet train / tlt unet evaluate) is giving me below output. The container exit immediately without any error message.

For multi-GPU, change --gpus based on your machine.
2022-09-26 19:30:56,152 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
2022-09-26 19:30:58,492 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Do you have any ideas on what might causing this issue?

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.

Please reinstall nvidia-tao.

$ pip3 install nvidia-tao==0.1.24

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.