TAO yolo_v3 google colab training failure

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
Tesla T4 in Google Colab
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Yolo V3 ResNet18
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
yolo_v3_train_resnet18_tfrecord.txt (1.9 KB)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

when I am executing the TAO train command to train TAO Yolo_v3 ResNet_18 model:

print(“To run with multigpu, please change --gpus based on the number of available GPUs in your machine.”)
!tao model yolo_v3 train -e $SPECS_DIR/yolo_v3_train_resnet18_tfrecord.txt
-r $EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 1

getting below error :

Epoch 1/10
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v3/scripts/train.py”, line 164, in
main()
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py”, line 717, in return_func
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py”, line 705, in return_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v3/scripts/train.py”, line 160, in main
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v3/scripts/train.py”, line 142, in main
run_experiment(
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v3/scripts/train.py”, line 94, in run_experiment
model.train(verbose)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v3/models/yolov3_model.py”, line 646, in train
self.keras_model.fit(
File “/usr/local/lib/python3.8/dist-packages/keras/engine/training.py”, line 1027, in fit
return training_arrays.fit_loop(self, f, ins,
File “/usr/local/lib/python3.8/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py”, line 2675, in _call
fetched = self._callable_fn(*array_vals)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 1470, in call
ret = tf_session.TF_SessionRunCallable(self._session._session,
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_4391}} /content/drive/MyDrive/cable_damage_yolov8_dataset/train/rename_and_save/images//content/drive/MyDrive/cable_damage_yolov8_dataset/train/rename_and_save/images/img_774.jpg; No such file or directory
[[{{node AssetLoader/ReadFile}}]]
[[data_loader_out]]
[[SparseSplit/_7625]]
(1) Not found: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_4391}} /content/drive/MyDrive/cable_damage_yolov8_dataset/train/rename_and_save/images//content/drive/MyDrive/cable_damage_yolov8_dataset/train/rename_and_save/images/img_774.jpg; No such file or directory
[[{{node AssetLoader/ReadFile}}]]
[[data_loader_out]]
0 successful operations.
0 derived errors ignored.
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: ‘str’ object has no attribute ‘decode’
Execution status: FAIL

The path is not correct. Please double check, especially tao_mounts.json file.

IN the Google Colab notebook provided by Nvidia, TAO yolo_v3 takes Kitti_data as default.
I have mounted my own dataset for model training. But somehow it is not identifying the dataset properly

after running the training command
print(“To run with multigpu, please change --gpus based on the number of available GPUs in your machine.”)
!tao model yolo_v3 train -e $SPECS_DIR/yolo_v3_train_resnet18_tfrecord.txt
-r $EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 1

getting error

/usr/local/lib/python3.8/dist-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was not compiled. Compile it manually.
warnings.warn('No training configuration found in save file: ’

Epoch 1/20
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: ‘str’ object has no attribute ‘decode’
Execution status: FAIL

Do you fix the above error? Seems that there are two “/content/drive/MyDrive/cable_damage_yolov8_dataset/train/rename_and_save/images/”.

1 Like

This problem is now solved. I have created a new folder and stored the data accroding to kitti format and updated the file paths