Please provide the following information when requesting support.
• Hardware (RTX A4000)
• Network Type (Yolo_v4)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
Hello,
i’m new to TAO and I was trying to follow the yolo_v4 sample notebook. Everything seemed to be running smoothly until the training part. Thats when I got the following output:
To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
2024-07-13 14:58:46,525 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-07-13 14:58:46,587 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2024-07-13 14:58:46,604 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/jakub/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
2024-07-13 14:58:46,604 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
Using TensorFlow backend.
2024-07-13 12:58:49.679827: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2024-07-13 12:58:49,713 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2024-07-13 12:58:50,496 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2024-07-13 12:58:50,517 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2024-07-13 12:58:50,520 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2024-07-13 12:58:51,462 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
WARNING: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
WARNING: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
WARNING: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py:55: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
WARNING: From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py:55: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py:58: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
WARNING: From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py:58: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
WARNING: From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
INFO: Log file already exists at /workspace/tao-experiments/yolo_v4/experiment_dir_unpruned/status.json
INFO: Starting Yolo_V4 Training job
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING: From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
WARNING: From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.
WARNING: From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/third_party/keras/tensorflow_backend.py:195: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.
WARNING: From /usr/local/lib/python3.8/dist-packages/third_party/keras/tensorflow_backend.py:195: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.
WARNING: From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:2018: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.
INFO: Serial augmentation enabled = False
INFO: Pseudo sharding enabled = False
INFO: Max Image Dimensions (all sources): (0, 0)
INFO: number of cpus: 16, io threads: 32, compute threads: 16, buffered batches: -1
INFO: total dataset size 6733, number of sources: 1, batch size per gpu: 20, steps: 337
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
WARNING: From /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
INFO: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
INFO: shuffle: True - shard 0 of 1
INFO: sampling 1 datasets with weights:
INFO: source: 0 weight: 1.000000
WARNING:tensorflow:The operation tf.image.convert_image_dtype
will be skipped since the input and output dtypes are identical.
WARNING: The operation tf.image.convert_image_dtype
will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation tf.image.convert_image_dtype
will be skipped since the input and output dtypes are identical.
WARNING: The operation tf.image.convert_image_dtype
will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation tf.image.convert_image_dtype
will be skipped since the input and output dtypes are identical.
WARNING: The operation tf.image.convert_image_dtype
will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/dataio/tf_data_pipe.py:131: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.
WARNING: From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/dataio/tf_data_pipe.py:131: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.
INFO: Model not found: /home/jakub/tao_eksperymenty/yolo_v4/pretrained_resnet18/pretrained_object_detection_vresnet18/resnet_18.hdf5
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py”, line 165, in
main()
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py”, line 717, in return_func
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py”, line 705, in return_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py”, line 161, in main
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py”, line 143, in main
run_experiment(
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/train.py”, line 84, in run_experiment
model = build_training_pipeline(
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/models/utils.py”, line 74, in build_training_pipeline
yolov4.build_training_model(hvd)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/models/yolov4_model.py”, line 480, in build_training_model
self.load_pretrained_model(
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/models/yolov4_model.py”, line 308, in load_pretrained_model
pretrained_model = model_io.load_model(
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/utils/model_io.py”, line 66, in load_model
model = load_keras_model(model_path,
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py”, line 576, in load_keras_model
raise FileNotFoundError(f"Model not found: {filepath}")
FileNotFoundError: Model not found: /home/jakub/tao_eksperymenty/yolo_v4/pretrained_resnet18/pretrained_object_detection_vresnet18/resnet_18.hdf5
Execution status: FAIL
2024-07-13 14:59:07,362 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
I can see the model file locally in that directory. This is the content of my ~/.tao_mounts.json file: {
“Mounts”: [
{
“source”: “/home/jakub/tao_eksperymenty”,
“destination”: “/workspace/tao-experiments”
},
{
“source”: “/home/jakub/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/yolo_v4/specs”,
“destination”: “/workspace/tao-experiments/yolo_v4/specs”
}
]
}
I suspect that this is an issue with docker mapping but as a beginner I cant seem to find a solution.
Any help would be greatly appreciated.