• Hardware (RTX 3080)
• Network Type (GazeNet)
• TAO Version
(nvidiaVenv) sam@MSI:~$ tao info --verbose
Configuration of the TAO Toolkit Instance
dockers:
nvidia/tao/tao-toolkit-tf:
v3.21.11-tf1.15.5-py3:
docker_registry: nvcr.io
tasks:
1. augment
2. bpnet
3. classification
4. dssd
5. emotionnet
6. efficientdet
7. fpenet
8. gazenet
9. gesturenet
10. heartratenet
11. lprnet
12. mask_rcnn
13. multitask_classification
14. retinanet
15. ssd
16. unet
17. yolo_v3
18. yolo_v4
19. yolo_v4_tiny
20. converter
v3.21.11-tf1.15.4-py3:
docker_registry: nvcr.io
tasks:
1. detectnet_v2
2. faster_rcnn
nvidia/tao/tao-toolkit-pyt:
v3.21.11-py3:
docker_registry: nvcr.io
tasks:
1. speech_to_text
2. speech_to_text_citrinet
3. text_classification
4. question_answering
5. token_classification
6. intent_slot_classification
7. punctuation_and_capitalization
8. spectro_gen
9. vocoder
10. action_recognition
nvidia/tao/tao-toolkit-lm:
v3.21.08-py3:
docker_registry: nvcr.io
tasks:
1. n_gram
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021
• Training spec file (I am using the exact instructions/training specs from the GazeNet Jupyter Notebook)
• How to reproduce the issue ?
To reproduce the error, go through the GazeNet Jupyter Notebook and follow the instructions exactly. It is on this command that I get an error:
tao gazenet dataset_convert -folder-suffix pipeline \
-norm_folder_name Norm_Data \
-sets p01-day03 \
-data_root_path /home/sam/cv_samples_v1.3.0/gazenet/MPIIFaceGaze/sample-dataset
Please note that I have adjusted the data_root_path path, and I believe that is the only adjustment I have made while following the Jupyter notebook.
When running the command, there a quite a few TensorFlow warnings that appear, with the error traceback at the end of the output:
2022-01-18 19:15:09.075541: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-h6t_1sup because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/mitgazenet/dataloader/augmentation_helper.py:22: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
2022-01-18 19:15:10,957 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/mitgazenet/dataloader/augmentation_helper.py:22: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/mitgazenet/dataloader/augmentation_helper.py:22: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
2022-01-18 19:15:10,957 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/mitgazenet/dataloader/augmentation_helper.py:22: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/mitgazenet/dataloader/gazenet_dataloader_augmentation_V2.py:38: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.
2022-01-18 19:15:10,958 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/mitgazenet/dataloader/gazenet_dataloader_augmentation_V2.py:38: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.
2022-01-18 19:15:11,039 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/dataio/data_converter.py:159: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/dataio/data_converter.py:162: The name tf.VarLenFeature is deprecated. Please use tf.io.VarLenFeature instead.
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/mitgazenet/scripts/dataset_convert.py", line 85, in <module>
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/mitgazenet/scripts/dataset_convert.py", line 78, in main
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/dataio/tfrecord_manager.py", line 60, in __init__
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/dataio/tfrecord_manager.py", line 102, in _extract_cosmos_paths
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/dataio/set_strategy_generator.py", line 50, in __init__
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/dataio/gaze_strategy.py", line 57, in __init__
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/dataio/gaze_strategy.py", line 198, in _set_source_paths
OSError: Unable to find a data source
Traceback (most recent call last):
File "/usr/local/bin/gazenet", line 8, in <module>
sys.exit(main())
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/mitgazenet/entrypoint/gazenet.py", line 13, in main
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/common/entrypoint/entrypoint.py", line 300, in launch_job
AssertionError: Process run failed.
I believe this developers forum post depicts a similar issue near the bottom comments for the GazeNet model, so I am unsure if this is the same underlying issue or not.
Please note that I am using wsl2, ubuntu 20.04, and cuda version 11.6.