Resume TLT3 yolov4 from weights

Question, My TLT3 training job (Yolov4) was killed overnight. The Jupyter notebook states: “To resume from checkpoint, please change pretrain_model_path to resume_model_path in config file.”

With this little information, I’m stuck.
Here’s the part in the configfile:
pretrain_model_path: “EXPERIMENT_DIR/pretrained_resnet18/tlt_pretrained_object_detection_vresnet18/resnet_18.hdf5”

Do I only need to change pretrain_model_path to resume_model_path?
Or should I replace with: resume_model_path: “USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/yolov4_resnet18_epoch_030.tlt”

No matter what I do, I’m getting a “file not found error”. With a giant trainset, I’d hate to start all over from scratch.

Gerard

Please comment out pretrain_model_path and then add resume_model_path. For your case,

# pretrain_model_path: “EXPERIMENT_DIR/pretrained_resnet18/tlt_pretrained_object_detection_vresnet18/resnet_18.hdf5”
resume_model_path: “USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/yolov4_resnet18_epoch_030.tlt”

Reference: YOLOv4 — Transfer Learning Toolkit 3.0 documentation

Process was killed again and can’t get it back to work again.

Here’s the 2 lines in the Yolo train config file

#pretrain_model_path: “EXPERIMENT_DIR/pretrained_resnet18/tlt_pretrained_object_detection_vresnet18/resnet_18.hdf5”
resume_model_path: “USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/yolov4_resnet18_epoch_050.tlt”

And here’s the Tracecack:
Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 209, in
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 205, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 56, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/utils/spec_loader.py”, line 20, in load_experiment_spec
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 737, in Merge
allow_unknown_field=allow_unknown_field)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 805, in MergeLines
return parser.MergeLines(lines, message)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 830, in MergeLines
self._ParseOrMerge(lines, message)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 852, in _ParseOrMerge
self._MergeField(tokenizer, message)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 980, in _MergeField
merger(tokenizer, message, field)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 1054, in _MergeMessageField
self._MergeField(tokenizer, sub_message)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 980, in _MergeField
merger(tokenizer, message, field)
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 1105, in _MergeScalarField
value = tokenizer.ConsumeString()
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 1476, in ConsumeString
the_bytes = self.ConsumeByteString()
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 1491, in ConsumeByteString
the_list = [self._ConsumeSingleByteString()]
File “/usr/local/lib/python3.6/dist-packages/google/protobuf/text_format.py”, line 1510, in _ConsumeSingleByteString
raise self.ParseError(‘Expected string but found: %r’ % (text,))
google.protobuf.text_format.ParseError: 46:20 : ‘resume_model_path: “USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/yolov4_resnet18_epoch_050.tlt”’: Expected string but found: ‘“’
Traceback (most recent call last):
File “/usr/local/bin/yolo_v4”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/entrypoint/yolo_v4.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.

Is above link available?

!ls -l $USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights
total 1187460
-rw-r–r-- 1 root root 243190472 Apr 27 13:39 yolov4_resnet18_epoch_010.tlt
-rw-r–r-- 1 root root 243190472 Apr 27 18:21 yolov4_resnet18_epoch_020.tlt
-rw-r–r-- 1 root root 243190472 Apr 27 23:04 yolov4_resnet18_epoch_030.tlt
-rw-r–r-- 1 root root 243190472 Apr 28 03:46 yolov4_resnet18_epoch_040.tlt
-rw-r–r-- 1 root root 243190472 Apr 28 08:32 yolov4_resnet18_epoch_050.tlt

Please set absolute path and retry.

More, please note that the path should be a path inside the docker instead of your host PC.

Must be an absolute beginners fault, but I messed “ up with ". Don’t know if it makes any difference. I ended up using: resume_model_path: “/workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned/weights/yolov4_resnet18_epoch_050.tlt”

now it purs like a kitten again. Thanks!