Can't run tao 3.0 w/ RTX A6000

Please provide the following information when requesting support.

• Hardware A6000
• Network Type Classification
• TLT Version http://nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
When I tried to run classification sample w/ A6000 GPU

the error log as below occurs.

Using TensorFlow backend.
Traceback (most recent call last):
File “/usr/local/bin/classification”, line 8, in
sys.exit(main())
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/entrypoint/makenet.py”, line 12, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 256, in launch_job
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 47, in get_modules
File “/usr/lib/python3.6/importlib/init.py”, line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 994, in _gcd_import
File “”, line 971, in _find_and_load
File “”, line 955, in _find_and_load_unlocked
File “”, line 665, in _load_unlocked
File “”, line 678, in exec_module
File “”, line 219, in _call_with_frames_removed
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/export.py”, line 8, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/export/classification_exporter.py”, line 14, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py”, line 22, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py”, line 27, in
File “/usr/local/lib/python3.6/dist-packages/pycuda/autoinit.py”, line 5, in
cuda.init()
pycuda._driver.LogicError: cuInit failed: forward compatibility was attempted on non supported HW
2021-12-13 17:38:04,036 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

May I know which OS did you run? Is it Ubuntu?

More, can you run below and share the full log including the commandline?

$ docker run --runtime=nvidia -it --rm  nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3  /bin/bash

and then

#python
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()

I run it on Ubuntu20. and I got the same result w/ the commandline as below

$ docker run --gpus all -it --rm nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 /bin/bash
–2021-12-14 09:30:09-- https://ngc.nvidia.com/downloads/ngccli_reg_linux.zip
Resolving ngc.nvidia.com (ngc.nvidia.com)… 54.230.168.6, 54.230.168.68, 54.230.168.100, …
Connecting to ngc.nvidia.com (ngc.nvidia.com)|54.230.168.6|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 25122731 (24M) [application/zip]
Saving to: ‘/opt/ngccli/ngccli_reg_linux.zip’

ngccli_reg_linux.zip 100%[=====================================================================================================================>] 23.96M 14.3MB/s in 1.7s

2021-12-14 09:30:11 (14.3 MB/s) - ‘/opt/ngccli/ngccli_reg_linux.zip’ saved [25122731/25122731]

Archive: /opt/ngccli/ngccli_reg_linux.zip
inflating: /opt/ngccli/ngc
extracting: /opt/ngccli/ngc.md5
root@54d5739243c5:/workspace# python
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import pycuda
import pycuda.driver as cuda
cuda.init()
Traceback (most recent call last):
File “”, line 1, in
pycuda._driver.LogicError: cuInit failed: forward compatibility was attempted on non supported HW

Please run with
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 /bin/bash

Same as $ docker run --gpus all -it --rm nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 /bin/bash

$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 /bin/bash
–2021-12-15 01:35:24-- https://ngc.nvidia.com/downloads/ngccli_reg_linux.zip
Resolving ngc.nvidia.com (ngc.nvidia.com)… 99.86.202.123, 99.86.202.18, 99.86.202.63, …
Connecting to ngc.nvidia.com (ngc.nvidia.com)|99.86.202.123|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 25122952 (24M) [application/zip]
Saving to: ‘/opt/ngccli/ngccli_reg_linux.zip’

ngccli_reg_linux.zip 100%[=====================================================================================================================>] 23.96M 14.6MB/s in 1.6s

2021-12-15 01:35:26 (14.6 MB/s) - ‘/opt/ngccli/ngccli_reg_linux.zip’ saved [25122952/25122952]

Archive: /opt/ngccli/ngccli_reg_linux.zip
inflating: /opt/ngccli/ngc
extracting: /opt/ngccli/ngc.md5
root@ba3be56731c5:/workspace# python
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import pycuda
import pycuda.driver as cuda
cuda.init()
Traceback (most recent call last):
File “”, line 1, in
pycuda._driver.LogicError: cuInit failed: forward compatibility was attempted on non supported HW

Can you share the result of
$ nvidia-smi

±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.3 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 RTX A6000 Off | 00000000:1A:00.0 Off | Off |
| 30% 26C P8 19W / 300W | 10MiB / 48685MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 RTX A6000 Off | 00000000:1B:00.0 Off | Off |
| 30% 26C P8 22W / 300W | 10MiB / 48685MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 RTX A6000 Off | 00000000:60:00.0 Off | Off |
| 30% 27C P8 18W / 300W | 10MiB / 48685MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 RTX A6000 Off | 00000000:61:00.0 Off | Off |
| 30% 27C P8 18W / 300W | 10MiB / 48685MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 RTX A6000 Off | 00000000:B1:00.0 Off | Off |
| 30% 27C P8 16W / 300W | 10MiB / 48685MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 5 RTX A6000 Off | 00000000:B2:00.0 Off | Off |
| 30% 26C P8 15W / 300W | 10MiB / 48685MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 6 RTX A6000 Off | 00000000:DA:00.0 Off | Off |
| 30% 24C P8 19W / 300W | 10MiB / 48685MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 7 RTX A6000 Off | 00000000:DB:00.0 Off | Off |
| 30% 25C P8 22W / 300W | 10MiB / 48685MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

Please check if topic pycuda._driver.LogicError: cuInit failed: system not yet initialized - #19 by gsweeney can help you.

Could you please explain the relation between ‘pycuda._driver.LogicError: cuInit failed: system not yet initialized - #19 by gsweeney’ and current issue?

Just in case you have HGX A100 - 8 GPU product . If not , that topic is different from yours.
You have set up only one gpu , right?

And please double check the software environment.
https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_quick_start_guide.html#software-requirements

Suggest you trying Ubuntu 18.04 machine as well.

More, you can also run different version of tao/tlt.
For example, docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3

All requirements is met. and I run tao successfully w/ RTX30x0 gpu machines.

and I found hardware specs sections in quick start guide.

Does it mean TAO does not not support A6000?

TAO can support A6000. You can find similar topic which is using RTX A6000. For example, Detectnetv2 wont train if pretrained_model_file is specified. Peoplenet transfer learning

Please try to update the driver for A6000.

1 Like

Can confirm using A6000’s to train on Nvidia TAO.

Driver version: 470.86 (latest).

1 Like

Thanks @pullmyleg

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.