Please provide the following information when requesting support.
• Hardware Tesla V100
• Network Type: LPRnet
• Driver version: 510.47.03
• CUDA version: 11.6
• TLT Version: nvidia/tao/tao-toolkit-tf: v3.21.11-tf1.15.5-py3(Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file: using tutorial_spec.txt
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
Command
tao lprnet train -e /workspace/tao-experiments/lprnet/tutorial_spec.txt -r /workspace/tao-experiments/lprnet/ -k nvidia_tlt -m /workspace/tao-experiments/lprnet/us_lprnet_baseline18_trainable.tlt
• Output
2022-05-05 19:27:20,536 [INFO] root: Registry: [‘nvcr.io’]
2022-05-05 19:27:20,683 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-sa1i0rir because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
nvidia-smi: ������H�CF� �⾿
{7���LV �C��2��A���0�p���Qf��֍n��ޡ�A�yú?�%q�oNT�=��g�&Z�?
���φh��6c�]��G���_has expired!
Please contact your provider jahidulhamid@yahoo.com
Traceback (most recent call last):
File “/usr/local/bin/lprnet”, line 8, in
sys.exit(main())
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/entrypoint/lprnet.py”, line 12, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 315, in launch_job
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 224, in set_gpu_info_single_node
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 192, in check_valid_gpus
File “/usr/lib/python3.6/subprocess.py”, line 356, in check_output
**kwargs).stdout
File “/usr/lib/python3.6/subprocess.py”, line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command ‘[‘nvidia-smi’, ‘-L’]’ returned non-zero exit status 1.
2022-05-05 19:27:26,968 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
I tried the same thing on a tesla T4 with
-
driver version Driver Version: 470.82.0
-
CUDA Version: 11.4
It worked perfectly.
So is the drver version the reason for failure?