Tao Docker container crashes after some time

• Hardware: A6000
• Network Type MaskRcnn
• TLT Version dockers:

['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021

Hey, I tried converting a trained etlt model to engine file but noticed a weird behavior while converting. The docker stops after a few seconds and does not convert anything at all, no logs, no memory usage nothing.

Here are the logs I get:

                                                                                                                                                                                                                      
2022-10-01 07:09:40,315 [INFO] root: Registry: ['nvcr.io']                                                                                                                                                                             
2022-10-01 07:09:40,342 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3                                                                   
2022-10-01 07:09:40,375 [WARNING] tlt.components.docker_handler.docker_handler:                                                                                                                                                        
Docker will run the commands as root. If you would like to retain your                                                                                                                                                                 
local host permissions, please add the "user":"UID:GID" in the                                                                                                                                                                         
DockerOptions portion of the "tao_mounts.json" file. You can obtain your                                                                                                                                              
users UID and GID by using the "id -u" and "id -g" commands on the                                                                                                                                                                     
terminal.                                                                                                                                                                                                                              
[INFO] [MemUsageChange] Init CUDA: CPU +536, GPU +0, now: CPU 542, GPU 19642 (MiB)                                                                                                                                                     
[INFO] [MemUsageSnapshot] Builder begin: CPU 848 MiB, GPU 19642 MiB                                                                                                                                                                    
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +749, GPU +318, now: CPU 1669, GPU 19960 (MiB)                                                                                                                                       
[INFO] [MemUsageChange] Init cuDNN: CPU +618, GPU +268, now: CPU 2287, GPU 20228 (MiB)                                                                                                                                                 
[WARNING] Detected invalid timing cache, setup a local cache instead                                                                                                                                                                   
2022-10-01 07:09:45,675 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I tried a couple of times but nothing seem to work.

Then I tried bashing in to the container and then run command to see if that works, to my surprise when I bash in the container using tao mask_rcnn run /bin/bash

It just exists itself after 5-6 seconds.

Here are the logs:


2022-10-01 07:14:42,935 [INFO] root: Registry: ['nvcr.io']
2022-10-01 07:14:42,962 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2022-10-01 07:14:43,000 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
root@23b04a10793d:/workspace# 2022-10-01 07:14:48,874 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

As you can see after login I didn’t do anything and within 5 sec it stopped with no log.

I am not sure what is wrong here could you please look into it?

Please reinstall nvidia-tao.

$ pip3 install nvidia-tao==0.1.24

hey, I cannot update tao as many models are trained on the older version so 1 model converted on newer tao and others on old won’t work.

Also while inferencing through updated tao I am getting this error:

[TensorRT] ERROR: 1: [stdArchiveReader.cpp::StdArchiveReader::34] Error Code 1: Serialization (Serialization ass[10/1907]
feVersionRead == safeSerializationVersion failed.Version tag does not match. Note: Current Version: 43, Serialized Engine
 Version: 0)                                                                                                             
[TensorRT] ERROR: 4: [runtime.cpp::deserializeCudaEngine::75] Error Code 4: Internal Error (Engine deserialization failed
.)       

Can I not just convert it using the current tao version? I tried re-installing the current version but that didn’t work. Get the same error.

Or you can refer to the two workarounds mentioned in Chmod: cannot access '/opt/ngccli/ngc': No such file or directory - #2 by Morganh.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.