Tao toolkit fails to train LPRnet model

Please provide the following information when requesting support.

• Hardware Tesla V100
• Network Type: LPRnet
• Driver version: 510.47.03
• CUDA version: 11.6
• TLT Version: nvidia/tao/tao-toolkit-tf: v3.21.11-tf1.15.5-py3(Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file: using tutorial_spec.txt
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Command

tao lprnet train -e /workspace/tao-experiments/lprnet/tutorial_spec.txt -r /workspace/tao-experiments/lprnet/ -k nvidia_tlt -m /workspace/tao-experiments/lprnet/us_lprnet_baseline18_trainable.tlt

Output

2022-05-05 19:27:20,536 [INFO] root: Registry: [‘nvcr.io’]
2022-05-05 19:27:20,683 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-sa1i0rir because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
nvidia-smi: ������H�CF� �⾿
{7���LV �C��2��A���0�p���Qf��֍n��ޡ�A�yú?�%q�oNT�=��g�&Z�?
���φh��6c�]��G���_has expired!
Please contact your provider jahidulhamid@yahoo.com
Traceback (most recent call last):
File “/usr/local/bin/lprnet”, line 8, in
sys.exit(main())
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/entrypoint/lprnet.py”, line 12, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 315, in launch_job
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 224, in set_gpu_info_single_node
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 192, in check_valid_gpus
File “/usr/lib/python3.6/subprocess.py”, line 356, in check_output
**kwargs).stdout
File “/usr/lib/python3.6/subprocess.py”, line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command ‘[‘nvidia-smi’, ‘-L’]’ returned non-zero exit status 1.
2022-05-05 19:27:26,968 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I tried the same thing on a tesla T4 with

  • driver version Driver Version: 470.82.0

  • CUDA Version: 11.4

It worked perfectly.
So is the drver version the reason for failure?

To narrow down, under 510 driver, how about the result while running below experiments?

tao lprnet -h

and

tao ssd -h

Okay so i fix the issue. Nothing wrong with the TAO tool-kit.

  • What i found out was that all container (that have gpu access) on the DGX server gave this error when using the nvidia-smi command. The gpus were visible to the containers (for example by using torch.cuda.get_device_properties in a pytorch-gpu container) but only gave error when ising the nvidia-smi

  • TAO lprnet train script at some point uses the nvidia-smi -L command before starting the training process as can be seen from the above detailed error

  • Basically restarted the Docker daemon and now everything works fine

Thank you

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.