TAO yolo_v3 google colab training failure

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
Tesla T4 in Google Colab
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Yolo V3 ResNet18
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
yolo_v3_train_resnet18_tfrecord.txt (1.9 KB)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

when I am executing the TAO train command to train TAO Yolo_v3 ResNet_18 model:

print(“To run with multigpu, please change --gpus based on the number of available GPUs in your machine.”)
!tao model yolo_v3 train -e $SPECS_DIR/yolo_v3_train_resnet18_tfrecord.txt
-r $EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 1

getting below error :

Epoch 1/10
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v3/scripts/train.py”, line 164, in
main()
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py”, line 717, in return_func
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py”, line 705, in return_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v3/scripts/train.py”, line 160, in main
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v3/scripts/train.py”, line 142, in main
run_experiment(
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v3/scripts/train.py”, line 94, in run_experiment
model.train(verbose)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v3/models/yolov3_model.py”, line 646, in train
self.keras_model.fit(
File “/usr/local/lib/python3.8/dist-packages/keras/engine/training.py”, line 1027, in fit
return training_arrays.fit_loop(self, f, ins,
File “/usr/local/lib/python3.8/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py”, line 2675, in _call
fetched = self._callable_fn(*array_vals)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 1470, in call
ret = tf_session.TF_SessionRunCallable(self._session._session,
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_4391}} /content/drive/MyDrive/cable_damage_yolov8_dataset/train/rename_and_save/images//content/drive/MyDrive/cable_damage_yolov8_dataset/train/rename_and_save/images/img_774.jpg; No such file or directory
[[{{node AssetLoader/ReadFile}}]]
[[data_loader_out]]
[[SparseSplit/_7625]]
(1) Not found: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_4391}} /content/drive/MyDrive/cable_damage_yolov8_dataset/train/rename_and_save/images//content/drive/MyDrive/cable_damage_yolov8_dataset/train/rename_and_save/images/img_774.jpg; No such file or directory
[[{{node AssetLoader/ReadFile}}]]
[[data_loader_out]]
0 successful operations.
0 derived errors ignored.
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: ‘str’ object has no attribute ‘decode’
Execution status: FAIL

The path is not correct. Please double check, especially tao_mounts.json file.

IN the Google Colab notebook provided by Nvidia, TAO yolo_v3 takes Kitti_data as default.
I have mounted my own dataset for model training. But somehow it is not identifying the dataset properly

after running the training command
print(“To run with multigpu, please change --gpus based on the number of available GPUs in your machine.”)
!tao model yolo_v3 train -e $SPECS_DIR/yolo_v3_train_resnet18_tfrecord.txt
-r $EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 1

getting error

/usr/local/lib/python3.8/dist-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was not compiled. Compile it manually.
warnings.warn('No training configuration found in save file: ’

Epoch 1/20
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: ‘str’ object has no attribute ‘decode’
Execution status: FAIL

Do you fix the above error? Seems that there are two “/content/drive/MyDrive/cable_damage_yolov8_dataset/train/rename_and_save/images/”.

1 Like

This problem is now solved. I have created a new folder and stored the data accroding to kitti format and updated the file paths

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.