Docker instantiation failed when running tao ssd

Hello,
when i run tao ssd dataset-convert …
in HP GPU machine,
I got the error as below, this is already after i reinstalled the driver, i used this: sudo apt install nvidia-driver460…

(launcher) root@HP-Z-DSWS:~# tao ssd dataset_convert \

     -d $SPECS_DIR/ssd_tfrecords_kitti_train.txt \
     -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_train

########
2021-12-02 17:08:05,450 [INFO] root: Registry: [‘nvcr.io’]
2021-12-02 17:08:05,643 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2021-12-02 17:08:05,755 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/root/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Using TensorFlow backend.
Traceback (most recent call last):
File “/usr/local/bin/ssd”, line 8, in
sys.exit(main())
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/entrypoint/ssd.py”, line 12, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 256, in launch_job
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 47, in get_modules
File “/usr/lib/python3.6/importlib/init.py”, line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 994, in _gcd_import
File “”, line 971, in _find_and_load
File “”, line 955, in _find_and_load_unlocked
File “”, line 665, in _load_unlocked
File “”, line 678, in exec_module
File “”, line 219, in _call_with_frames_removed
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/export.py”, line 10, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/export/ssd_exporter.py”, line 30, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py”, line 22, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py”, line 27, in
File “/usr/local/lib/python3.6/dist-packages/pycuda/autoinit.py”, line 5, in
cuda.init()
pycuda._driver.LogicError: cuInit failed: forward compatibility was attempted on non supported HW
2021-12-02 17:08:11,241 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

• Hardware: GPU name: Quadro RTX 8000
Driver Version: 460.106.00 CUDA Version: 11.2
• Network Type: ssd_resnet18
• TLT Version
dockers:
nvidia/tao/tao-toolkit-tf:
v3.21.11-tf1.15.5-py3:

            v3.21.11-tf1.15.4-py3:
    nvidia/tao/tao-toolkit-pyt:
            v3.21.11-py3:
    nvidia/tao/tao-toolkit-lm:
            v3.21.08-py3:

format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021

Thanks a lot!

Can you run below command and share the result?
$ nvidia-smi

How about
$ dpkg -l |grep cuda

yes, i can, as above shown, thanks!

No, it should not be the same result. Could you please run
$ dpkg -l |grep cuda

ii cuda-command-line-tools-10-1 10.1.243-1 amd64 CUDA command-line tools
ii cuda-compiler-10-1 10.1.243-1 amd64 CUDA compiler
ii cuda-cudart-10-1 10.1.243-1 amd64 CUDA Runtime native Libraries
ii cuda-cudart-dev-10-1 10.1.243-1 amd64 CUDA Runtime native dev links, headers
ii cuda-cufft-10-1 10.1.243-1 amd64 CUFFT native runtime libraries
ii cuda-cufft-dev-10-1 10.1.243-1 amd64 CUFFT native dev links, headers
ii cuda-cuobjdump-10-1 10.1.243-1 amd64 CUDA cuobjdump
ii cuda-cupti-10-1 10.1.243-1 amd64 CUDA profiling tools interface.
ii cuda-curand-10-1 10.1.243-1 amd64 CURAND native runtime libraries
ii cuda-curand-dev-10-1 10.1.243-1 amd64 CURAND native dev links, headers
ii cuda-cusolver-10-1 10.1.243-1 amd64 CUDA solver native runtime libraries
ii cuda-cusolver-dev-10-1 10.1.243-1 amd64 CUDA solver native dev links, headers
ii cuda-cusparse-10-1 10.1.243-1 amd64 CUSPARSE native runtime libraries
ii cuda-cusparse-dev-10-1 10.1.243-1 amd64 CUSPARSE native dev links, headers
ii cuda-documentation-10-1 10.1.243-1 amd64 CUDA documentation
ii cuda-driver-dev-10-1 10.1.243-1 amd64 CUDA Driver native dev stub library
ii cuda-gdb-10-1 10.1.243-1 amd64 CUDA-GDB
ii cuda-gpu-library-advisor-10-1 10.1.243-1 amd64 CUDA GPU Library Advisor.
ii cuda-libraries-dev-10-1 10.1.243-1 amd64 CUDA Libraries 10.1 development meta-package
ii cuda-license-10-1 10.1.243-1 amd64 CUDA licenses
ii cuda-license-10-2 10.2.89-1 amd64 CUDA licenses
ii cuda-memcheck-10-1 10.1.243-1 amd64 CUDA-MEMCHECK
ii cuda-misc-headers-10-1 10.1.243-1 amd64 CUDA miscellaneous headers
ii cuda-npp-10-1 10.1.243-1 amd64 NPP native runtime libraries
ii cuda-npp-dev-10-1 10.1.243-1 amd64 NPP native dev links, headers
ii cuda-nsight-10-1 10.1.243-1 amd64 CUDA nsight
ii cuda-nsight-compute-10-1 10.1.243-1 amd64 NVIDIA Nsight Compute
ii cuda-nsight-systems-10-1 10.1.243-1 amd64 NVIDIA Nsight Systems
ii cuda-nvcc-10-1 10.1.243-1 amd64 CUDA nvcc
ii cuda-nvdisasm-10-1 10.1.243-1 amd64 CUDA disassembler
ii cuda-nvgraph-10-1 10.1.243-1 amd64 NVGRAPH native runtime libraries
ii cuda-nvgraph-dev-10-1 10.1.243-1 amd64 NVGRAPH native dev links, headers
ii cuda-nvjpeg-10-1 10.1.243-1 amd64 NVJPEG native runtime libraries
ii cuda-nvjpeg-dev-10-1 10.1.243-1 amd64 NVJPEG native dev links, headers
ii cuda-nvml-dev-10-1 10.1.243-1 amd64 NVML native dev links, headers
ii cuda-nvprof-10-1 10.1.243-1 amd64 CUDA Profiler tools
ii cuda-nvprune-10-1 10.1.243-1 amd64 CUDA nvprune
ii cuda-nvrtc-10-1 10.1.243-1 amd64 NVRTC native runtime libraries
ii cuda-nvrtc-dev-10-1 10.1.243-1 amd64 NVRTC native dev links, headers
ii cuda-nvtx-10-1 10.1.243-1 amd64 NVIDIA Tools Extension
ii cuda-nvvp-10-1 10.1.243-1 amd64 CUDA nvvp
ii cuda-repo-ubuntu1804 10.2.89-1 amd64 cuda repository configuration files
ii cuda-samples-10-1 10.1.243-1 amd64 CUDA example applications
ii cuda-sanitizer-api-10-1 10.1.243-1 amd64 CUDA Sanitizer API
ii cuda-toolkit-10-1 10.1.243-1 amd64 CUDA Toolkit 10.1 meta-package
ii cuda-tools-10-1 10.1.243-1 amd64 CUDA Tools meta-package
ii cuda-visual-tools-10-1 10.1.243-1 amd64 CUDA visual tools

I got this as above.
Actually I used tlt-streamanalytics container in the beginning: nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3, it can train detectnet_v2_resnet18, it used the code tlt-train detectnet_v2 … but it not worked with tlt-train ssd, and then i went to tao.

This does not make sense. For tlt 2.0, “tlt-train ssd” should be working as well.

For your current issue, I am afraid you need to update CUDA.

Thanks for the reply!
Actually I used tlt-streamanalytics container in the beginning: nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3, it can train detectnet_v2_resnet18, it used the code tlt-train detectnet_v2 … but it not worked with tlt-train ssd, and then i went to tao.
my question firstly related to deepstream python app: why a tlt detectnet_v2_resnet18 model with 90% accuracy cannot detect anything when applying to deepstream python apps? this is another topic i am looking for the answer.
second quesiton is why tlt-train ssd in the container nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3 gives this error meanwhile it works with tlt-train detectnet_v2 image
the third question is about tao-ssd, why there is cudainit failed.
Sorry for so many questions. these three questions are the steps i tried and got one by one… I am thinking whether i should raise three topics, or can ask all here? thanks so much!!

For my first question about deepstream python apps not detecting anything with detectnet_V2_resnet18 model of 90% accuracy, I only saw this in the forum.

update cuda, ok, i will try, thanks, do you have any idea about the deepstream python apps not detecting bbox (another question as above)?

Please create your issues in new topics separately.

1 Like

thanks, i will do that, just a quick question: tlt-train in the container nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3 should give same results as tao in terms of models themselves part?

Actually TLT 2.0 is released in Aug,2020.
See

The TAO version starts from 3.21.08 in 2021.
So, we cannot compare these two version directly.

1 Like

ok I see, thanks:)

Can I ask something still related to TLT/TAO here?
tlt-train detectnet_v2, prune and retrain went well, but when i go for tlt-infer, the errors come out like this: Using TensorFlow backend.
2021-12-03 22:58:24.552363: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Traceback (most recent call last):
File “/usr/local/bin/tlt-infer”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_infer.py”, line 54, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/inference.py”, line 187, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/spec_handler/spec_loader.py”, line 88, in load_inference_spec
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/spec_handler/spec_loader.py”, line 50, in load_proto
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/spec_handler/spec_loader.py”, line 36, in _load_from_file
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 734, in Merge
allow_unknown_field=allow_unknown_field)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 802, in MergeLines
return parser.MergeLines(lines, message)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 827, in MergeLines
self._ParseOrMerge(lines, message)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 849, in _ParseOrMerge
self._MergeField(tokenizer, message)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 974, in _MergeField
merger(tokenizer, message, field)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 1048, in _MergeMessageField
self._MergeField(tokenizer, sub_message)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 974, in _MergeField
merger(tokenizer, message, field)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 1048, in _MergeMessageField
self._MergeField(tokenizer, sub_message)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 974, in _MergeField
merger(tokenizer, message, field)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 1048, in _MergeMessageField
self._MergeField(tokenizer, sub_message)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 974, in _MergeField
merger(tokenizer, message, field)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 1048, in _MergeMessageField
self._MergeField(tokenizer, sub_message)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 941, in _MergeField
(message_descriptor.full_name, name))
google.protobuf.text_format.ParseError: 33:9 : Message type “ClusteringConfig” has no field named “clustering_algorithm”.

When you use old version of tlt, suggest you download corresponding jupyter notebook or refer to its user guide.
Different version of TLT/TAO may have different parameters.

Above is an example. You need to modify the inference spec file.

TLT/TAO user guide: NVIDIA TAO Documentation

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.