Tao Docker container crashes after some time

alaapdhall79 · October 1, 2022, 12:51pm

• Hardware: A6000
• Network Type MaskRcnn
• TLT Version dockers:

['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021

Hey, I tried converting a trained etlt model to engine file but noticed a weird behavior while converting. The docker stops after a few seconds and does not convert anything at all, no logs, no memory usage nothing.

Here are the logs I get:

                                                                                                                                                                                                                      
2022-10-01 07:09:40,315 [INFO] root: Registry: ['nvcr.io']                                                                                                                                                                             
2022-10-01 07:09:40,342 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3                                                                   
2022-10-01 07:09:40,375 [WARNING] tlt.components.docker_handler.docker_handler:                                                                                                                                                        
Docker will run the commands as root. If you would like to retain your                                                                                                                                                                 
local host permissions, please add the "user":"UID:GID" in the                                                                                                                                                                         
DockerOptions portion of the "tao_mounts.json" file. You can obtain your                                                                                                                                              
users UID and GID by using the "id -u" and "id -g" commands on the                                                                                                                                                                     
terminal.                                                                                                                                                                                                                              
[INFO] [MemUsageChange] Init CUDA: CPU +536, GPU +0, now: CPU 542, GPU 19642 (MiB)                                                                                                                                                     
[INFO] [MemUsageSnapshot] Builder begin: CPU 848 MiB, GPU 19642 MiB                                                                                                                                                                    
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +749, GPU +318, now: CPU 1669, GPU 19960 (MiB)                                                                                                                                       
[INFO] [MemUsageChange] Init cuDNN: CPU +618, GPU +268, now: CPU 2287, GPU 20228 (MiB)                                                                                                                                                 
[WARNING] Detected invalid timing cache, setup a local cache instead                                                                                                                                                                   
2022-10-01 07:09:45,675 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I tried a couple of times but nothing seem to work.

Then I tried bashing in to the container and then run command to see if that works, to my surprise when I bash in the container using tao mask_rcnn run /bin/bash

It just exists itself after 5-6 seconds.

Here are the logs:


2022-10-01 07:14:42,935 [INFO] root: Registry: ['nvcr.io']
2022-10-01 07:14:42,962 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2022-10-01 07:14:43,000 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
root@23b04a10793d:/workspace# 2022-10-01 07:14:48,874 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

As you can see after login I didn’t do anything and within 5 sec it stopped with no log.

I am not sure what is wrong here could you please look into it?

Morganh · October 1, 2022, 2:45pm

Please reinstall nvidia-tao.

$ pip3 install nvidia-tao==0.1.24

alaapdhall79 · October 1, 2022, 6:22pm

hey, I cannot update tao as many models are trained on the older version so 1 model converted on newer tao and others on old won’t work.

Also while inferencing through updated tao I am getting this error:

[TensorRT] ERROR: 1: [stdArchiveReader.cpp::StdArchiveReader::34] Error Code 1: Serialization (Serialization ass[10/1907]
feVersionRead == safeSerializationVersion failed.Version tag does not match. Note: Current Version: 43, Serialized Engine
 Version: 0)                                                                                                             
[TensorRT] ERROR: 4: [runtime.cpp::deserializeCudaEngine::75] Error Code 4: Internal Error (Engine deserialization failed
.)

Can I not just convert it using the current tao version? I tried re-installing the current version but that didn’t work. Get the same error.

Morganh · October 2, 2022, 2:52am

Or you can refer to the two workarounds mentioned in Chmod: cannot access '/opt/ngccli/ngc': No such file or directory - #2 by Morganh.

system · October 16, 2022, 2:52am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TAO 5.0.0. TF1 Container fail to run tao model yolo_v4 dataset_convert command TAO Toolkit	4	351	October 5, 2023
TAO faster_rcnn not working TAO Toolkit	19	491	February 22, 2022
Docker instantiation fails when running "tao detectnet_v2" on Xavier NX Jetson AGX Xavier docker	5	555	October 5, 2022
OSError: Specfile not found plz help TAO Toolkit	16	1583	October 12, 2021
Convert to TensorRT engine(FP16). Stop here TAO Toolkit	3	402	July 12, 2022
New computer install GPU Docker error TAO Toolkit	6	1823	September 12, 2023
TAO toolkit happend some .so bug TAO Toolkit tao	19	903	September 9, 2022
Unable to successfully execute tao command in cv_samples_v1.4.0 TAO Toolkit	10	538	September 6, 2022
Tlt.components.docker_handler.docker_handler: Stopping container TAO Toolkit	18	1823	July 26, 2022
TLT 3.0 Container Error while Convert to TFRecord TAO Toolkit	4	584	September 11, 2021

Tao Docker container crashes after some time

Related topics