TLT v3 fails to train PeopleNet model from ngc

dimakan · June 28, 2021, 1:18pm

Hi
I’m able to run detectnet_v2 without any issue from within conteiner (without tlt wrapper).
Tried to follow the bellow tutorial to train PeopleNet model, as we need person detector and have relativly small amount of data, so perer to start from PeopleNet and not detectnet_v2.
https://developer.nvidia.com/blog/training-custom-pretrained-models-using-tlt/
For detectnet_v2, ngc has resnet34.hdf5, which is in hdf5, but for PeopleNet it’s tlt: tlt_peoplenet_vunpruned_v2.1/resnet34_peoplenet.tlt. Tried v2.1/v2.0, unprunned/prunned with same result of falure.
Command for training inside the container (spec file attached):
detectnet_v2 train -e /workspace/detectnet_v2/specs/peoplenet_train_resnet34_person_kitti.txt -r /workspace/data/detectnet_v2/experiment_dir_unpruned_resnet34_person -k tlt_encode -n peoplenet_resnet34_detector --gpus 1

Tried adding “load_graph: true” to model_config as in retrain config, that acceps .tll, but no luck. got other errors.

tlt.log (62.8 KB)
peoplenet_train_resnet34_person_kitti.txt (2.8 KB)

carolinainmymind · June 29, 2021, 5:11pm

Hey, I had this issue and mine was solved when I deleted the output folder and it’s contents. Left over trash from past attempts to train the net were causing it my case

kayccc · June 29, 2021, 11:19pm

Moving into Transfer Learning Toolkit forum for resolution.

Morganh · June 30, 2021, 2:26am

@dimakan
Please remove the result folder and try again.
Reference: Problem with training peoplenetv2 - #5
DetectNet v2 training error - "ValueError: The zipfile extracted was corrupt. Please check your key "

dimakan · June 30, 2021, 3:30am

Hi Morgan,
As you can see in the screenshoot of error, I did the rm of the folder.
BUT, it was the wrong one!
No idea how I missed it, as I’m aware of this tlt limitation :(

Thank you and Carolina!

system · August 29, 2021, 4:50am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.