• Hardware: AWS EC2 g4dn.xlarge
• Network Type: peoplenet_vtrainable_v2.5 resnet34_peoplenet.tlt
• TLT Version: TAO version 5
• Training spec file
peoplenet34_heads.txt (3.1 KB)
• How to reproduce the issue:
Run with python 3.8 in jupyter notebook
tao model detectnet_v2 train -e $SPECS_DIR/peoplenet34_heads.txt \ -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \ -n resnet18_detector \ --gpus $NUM_GPUS \ -k tlt_encode
- Error message:
2023-09-28 15:04:23,438 [TAO Toolkit] [INFO] tensorflow 692: global_step/sec: 1.78093
2023-09-28 15:04:26,926 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 7.085
INFO:tensorflow:epoch = 0.9122137404580152, learning_rate = 0.00049999997, loss = 0.013414521, step = 478 (5.830 sec)
2023-09-28 15:04:29,266 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.9122137404580152, learning_rate = 0.00049999997, loss = 0.013414521, step = 478 (5.830 sec)
INFO:tensorflow:epoch = 0.933206106870229, learning_rate = 0.00049999997, loss = 0.016209295, step = 489 (6.025 sec)
2023-09-28 15:04:35,291 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.933206106870229, learning_rate = 0.00049999997, loss = 0.016209295, step = 489 (6.025 sec)
INFO:tensorflow:epoch = 0.9522900763358778, learning_rate = 0.00049999997, loss = 0.014432838, step = 499 (5.672 sec)
2023-09-28 15:04:40,964 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.9522900763358778, learning_rate = 0.00049999997, loss = 0.014432838, step = 499 (5.672 sec)
2023-09-28 15:04:40,964 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 7.124
INFO:tensorflow:epoch = 0.9713740458015266, learning_rate = 0.00049999997, loss = 0.014740716, step = 509 (5.692 sec)
2023-09-28 15:04:46,656 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.9713740458015266, learning_rate = 0.00049999997, loss = 0.014740716, step = 509 (5.692 sec)
INFO:tensorflow:epoch = 0.9904580152671756, learning_rate = 0.00049999997, loss = 0.016124992, step = 519 (5.687 sec)
2023-09-28 15:04:52,343 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.9904580152671756, learning_rate = 0.00049999997, loss = 0.016124992, step = 519 (5.687 sec)
INFO:tensorflow:global_step/sec: 1.76236
2023-09-28 15:04:52,943 [TAO Toolkit] [INFO] tensorflow 692: global_step/sec: 1.76236
[1695913495.912078] [0ac105827284:216 :f] vfs_fuse.c:424 UCX WARN failed to connect to vfs socket '': Invalid argument
2023-09-28 15:04:56,003 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.evaluation.evaluation 130: step 0 / 58, 0.00s/step
Execution status: FAIL
2023-09-28 15:05:08,039 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.
Thank you to anyone that can be of any help :)