Docker instantiation fails when running "tao detectnet_v2" on Xavier NX

coreyslick · September 7, 2022, 7:01pm

I’m attempting to run “tao detectnet_v2 --help” on an Xavier NX and get the following output:

(launcher) nvidia@ubuntu:~/my_apps/testing_tensorRT__files_from_AWS_example_notebook$ tao detectnet_v2 --help
2022-09-07 14:49:48,190 [INFO] root: Registry: [‘nvcr.io’]
2022-09-07 14:49:48,470 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as ‘csv’
invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime instead.: unknown”)

I have run “sudo apt-get install -y nvidia-docker2” and also “sudo apt-get install nvidia-container-runtime” - both are installed.

Running “tao -h” gives the following output:

(launcher) nvidia@ubuntu:~/my_apps/testing_tensorRT__files_from_AWS_example_notebook$ tao -h
usage: tao [-h]
{list,stop,info,action_recognition,augment,bpnet,classification,converter,detectnet_v2,dssd,efficientdet,emotionnet,faster_rcnn,fpenet,gazenet,gesturenet,heartratenet,intent_slot_classification,lprnet,mask_rcnn,multitask_classification,n_gram,pointpillars,pose_classification,punctuation_and_capitalization,question_answering,retinanet,spectro_gen,speech_to_text,speech_to_text_citrinet,speech_to_text_conformer,ssd,text_classification,token_classification,unet,vocoder,yolo_v3,yolo_v4,yolo_v4_tiny}
…

Launcher for TAO Toolkit.

optional arguments:
-h, --help show this help message and exit

tasks:
{list,stop,info,action_recognition,augment,bpnet,classification,converter,detectnet_v2,dssd,efficientdet,emotionnet,faster_rcnn,fpenet,gazenet,gesturenet,heartratenet,intent_slot_classification,lprnet,mask_rcnn,multitask_classification,n_gram,pointpillars,pose_classification,punctuation_and_capitalization,question_answering,retinanet,spectro_gen,speech_to_text,speech_to_text_citrinet,speech_to_text_conformer,ssd,text_classification,token_classification,unet,vocoder,yolo_v3,yolo_v4,yolo_v4_tiny}

I have defined a ~/.tao_mounts.json file as follows:

{
“Mounts”: [
{
“source”: “/home/nvidia/my_apps/testing_tensorRT__files_from_AWS_example_notebook”,
“destination”: “/workspace/tlt-experiments”
}
],
“Envs”: [
{
“variable”: “CUDA_DEVICE_ORDER”,
“value”: “PCI_BUS_ID”
}
],
“DockerOptions”: {
“shm_size”: “16G”,
“ulimits”: {
“memlock”: -1,
“stack”: 67108864
},
“user”:“1000:1000”,
“ports”: {
“8888”: 8888
}
}
}

The Docker images on the Xavier are as follows:

REPOSITORY TAG IMAGE ID CREATED SIZE
nvcr.io/nvidia/l4t-tensorflow r35.1.0-tf1.15-py3__my_updates b16634b62b0a 6 days ago 13GB
nvcr.io/nvidia/l4t-tensorrt r8.4.1.5-devel 9d233de1abe7 12 days ago 10.2GB
nvcr.io/nvidia/l4t-cuda 11.4.14-runtime 17b2eaaef496 6 weeks ago 2.48GB
nvcr.io/nvidia/tao/tao-toolkit-tf v3.22.05-tf1.15.5-py3 b85103564252 3 months ago 11.7GB
nvcr.io/nvidia/tao/tao-toolkit-tf v3.22.05-tf1.15.4-py3 ca92a571a959 3 months ago 16.1GB
nvcr.io/nvidia/deepstream-l4t 6.1-samples 6fc8884e47d9 4 months ago 6.07GB
nvcr.io/nvidia/deepstream-l4t 6.1-base 0f92b3eb66ba 4 months ago 5.4GB
nvcr.io/nvidia/dli/dli-nano-deepstream v2.0.0-DS6.0.1 eb0e1e157f1d 5 months ago 2.22GB
nvcr.io/nvidia/tao/tao-cv-inference-pipeline-l4t r32.5.0-v0.3-ga-client bac152d44466 12 months ago 877MB

My final goal is to convert the combination of .etlt & .bin files from an nVidia TAO Toolkit example Jupyter notebook to a TensorRT engine file on the Xavier. Running the command for that purpose (“tao converter resnet18_detector.etlt -k $KEY…”) throws the same error as above when running “tao detectnet_v2”. This led me to trying the latter option to attempt to isolate the problem.

Any ideas on how to have the tao command run the associated Docker containers using the nVidia Container Runtime?

coreyslick · September 7, 2022, 7:13pm

I just tried adding nvidia as the default runtime in the /etc/docker/daemon.json:

{
“default-runtime”: “nvidia”,
“runtimes”: {
“nvidia”: {
“path”: “nvidia-container-runtime”,
“runtimeArgs”:
}
}
}

Now when I run “tao detectnet_v2 --help” I get different error messages:

(launcher) nvidia@ubuntu:~/my_apps/testing_tensorRT__files_from_AWS_example_notebook$ tao detectnet_v2 --help
2022-09-07 15:11:30,669 [INFO] root: Registry: [‘nvcr.io’]
2022-09-07 15:11:31,093 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3
Error response from daemon: Container c25d552a82157f71c6ecb7204518d8897d47069b8e0166764491c7cace038a44 is not running
2022-09-07 15:11:33,394 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
Traceback (most recent call last):
File “/home/nvidia/miniconda3/envs/launcher/lib/python3.6/site-packages/docker/api/client.py”, line 259, in _raise_for_status
response.raise_for_status()
File “/home/nvidia/miniconda3/envs/launcher/lib/python3.6/site-packages/requests/models.py”, line 960, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.41/containers/c25d552a82157f71c6ecb7204518d8897d47069b8e0166764491c7cace038a44/stop

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/nvidia/miniconda3/envs/launcher/bin/tao”, line 8, in
sys.exit(main())
File “/home/nvidia/miniconda3/envs/launcher/lib/python3.6/site-packages/tlt/entrypoint/entrypoint.py”, line 115, in main
args[1:]
File “/home/nvidia/miniconda3/envs/launcher/lib/python3.6/site-packages/tlt/components/instance_handler/local_instance.py”, line 319, in launch_command
docker_handler.run_container(command)
File “/home/nvidia/miniconda3/envs/launcher/lib/python3.6/site-packages/tlt/components/docker_handler/docker_handler.py”, line 316, in run_container
self.stop_container()
File “/home/nvidia/miniconda3/envs/launcher/lib/python3.6/site-packages/tlt/components/docker_handler/docker_handler.py”, line 323, in stop_container
self._container.stop()
File “/home/nvidia/miniconda3/envs/launcher/lib/python3.6/site-packages/docker/models/containers.py”, line 436, in stop
return self.client.api.stop(self.id, **kwargs)
File “/home/nvidia/miniconda3/envs/launcher/lib/python3.6/site-packages/docker/utils/decorators.py”, line 19, in wrapped
return f(self, resource_id, *args, **kwargs)
File “/home/nvidia/miniconda3/envs/launcher/lib/python3.6/site-packages/docker/api/container.py”, line 1167, in stop
self._raise_for_status(res)
File “/home/nvidia/miniconda3/envs/launcher/lib/python3.6/site-packages/docker/api/client.py”, line 261, in _raise_for_status
raise create_api_error_from_http_exception(e)
File “/home/nvidia/miniconda3/envs/launcher/lib/python3.6/site-packages/docker/errors.py”, line 31, in create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation)
docker.errors.NotFound: 404 Client Error: Not Found (“No such container: c25d552a82157f71c6ecb7204518d8897d47069b8e0166764491c7cace038a44”)

AastaLLL · September 8, 2022, 3:15am

Hi,

Please noted that TAO toolkit need to be executed on a desktop environment.

https://developer.nvidia.com/tao-toolkit

Can I train my model with the TAO Toolkit on a Jetson solution?

You can only train with TAO toolkit on an x86 system. You can, however, deploy the optimized models on a Jetson solution.

Thanks.

coreyslick · September 8, 2022, 3:16pm

OK, so I downloaded the Jetson version of the TAO Converter from here:
https://developer.nvidia.com/tao-converter

Performed the steps listed in the Readme file, made “tao-converter” executable, “export KEY=tlt_encode”, and then tried to run the converter:

./tao-converter ~/my_apps/testing_tensorRT__files_from_AWS_example_notebook/resnet18_detector.etlt -k $KEY -c calibration.bin -o output_cov/Sigmoid,output_bbox/BiasAdd -d 3,384,1248 -i nchw -m 64 -t int8 -e resnet18_detector.trt -b 4

This produces a new error:
./tao-converter: error while loading shared libraries: libnvinfer.so.7: cannot open shared object file: No such file or directory

I’m running JetPack 5.0.2, but I don’t see the above shared object file anywhere on the Xavier. Is there some additional step(s) I should be performing before running the tao-converter utility, or is there a specific nVidia Docker container I should run it from within?

AastaLLL · September 12, 2022, 3:43am

Hi,

It looks like the tool you downloaded is for Clara rather than JetPack.

For JetPack 5.0.2 GA, please find the corresponding TAO converter below:

https://docs.nvidia.com/tao/tao-toolkit/text/tensorrt.html#installing-the-tao-converter

Thanks.

system · October 5, 2022, 3:43am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.