TLT v3 fails to train PeopleNet model from ngc

I’m able to run detectnet_v2 without any issue from within conteiner (without tlt wrapper).
Tried to follow the bellow tutorial to train PeopleNet model, as we need person detector and have relativly small amount of data, so perer to start from PeopleNet and not detectnet_v2.

For detectnet_v2, ngc has resnet34.hdf5, which is in hdf5, but for PeopleNet it’s tlt: tlt_peoplenet_vunpruned_v2.1/resnet34_peoplenet.tlt. Tried v2.1/v2.0, unprunned/prunned with same result of falure.
Command for training inside the container (spec file attached):
detectnet_v2 train -e /workspace/detectnet_v2/specs/peoplenet_train_resnet34_person_kitti.txt -r /workspace/data/detectnet_v2/experiment_dir_unpruned_resnet34_person -k tlt_encode -n peoplenet_resnet34_detector --gpus 1

Tried adding “load_graph: true” to model_config as in retrain config, that acceps .tll, but no luck. got other errors.

tlt.log (62.8 KB)
peoplenet_train_resnet34_person_kitti.txt (2.8 KB)

Hey, I had this issue and mine was solved when I deleted the output folder and it’s contents. Left over trash from past attempts to train the net were causing it my case

Moving into Transfer Learning Toolkit forum for resolution.

Please remove the result folder and try again.
Reference: Problem with training peoplenetv2 - #5
DetectNet v2 training error - "ValueError: The zipfile extracted was corrupt. Please check your key "

Hi Morgan,
As you can see in the screenshoot of error, I did the rm of the folder.
BUT, it was the wrong one!
No idea how I missed it, as I’m aware of this tlt limitation :(

Thank you and Carolina!

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.