when i retrained the trafficcamnet model on my own custom datasets, ofcourse, i converted them into tfrecord before throwing them in the tlt toolkit engine, everything went well during the training stage, the training log epoch and steps displayed , however, the error occurred in the stage of validation _evaluation.
2021-04-28 06:51:33,395 [INFO] tensorflow: global_step/sec: 1.73384
INFO:tensorflow:epoch = 0.9116279069767441, loss = 0.04012449, step = 196 (5.852 sec)
2021-04-28 06:51:37,409 [INFO] tensorflow: epoch = 0.9116279069767441, loss = 0.04012449, step = 196 (5.852 sec)
2021-04-28 06:51:39,116 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 69.516
INFO:tensorflow:epoch = 0.958139534883721, loss = 0.039332416, step = 206 (5.711 sec)
2021-04-28 06:51:43,120 [INFO] tensorflow: epoch = 0.958139534883721, loss = 0.039332416, step = 206 (5.711 sec)
INFO:tensorflow:global_step/sec: 1.7304
2021-04-28 06:51:45,531 [INFO] tensorflow: global_step/sec: 1.7304
89f5a48d3120:63:107 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
89f5a48d3120:63:107 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
89f5a48d3120:63:107 [0] NCCL INFO NET/IB : No device found.
89f5a48d3120:63:107 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
89f5a48d3120:63:107 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
89f5a48d3120:63:107 [0] NCCL INFO Channel 00/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 01/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 02/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 03/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 04/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 05/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 06/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 07/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 08/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 09/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 10/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 11/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 12/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 13/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 14/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 15/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 16/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 17/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 18/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 19/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 20/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 21/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 22/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 23/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 24/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 25/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 26/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 27/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 28/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 29/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 30/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Channel 31/32 : 0
89f5a48d3120:63:107 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0->-1/
89f5a48d3120:63:107 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
89f5a48d3120:63:107 [0] NCCL INFO comm 0x7fac9438de40 rank 0 nranks 1 cudaDev 0 busId 1000 - Init COMPLETE
2021-04-28 06:51:48,063 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 37, 0.00s/step
Traceback (most recent call last):
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 797, in
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 790, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 624, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 149, in run_training_loop
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 754, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1360, in run
raise six.reraise(*original_exc_info)
File “/usr/local/lib/python3.6/dist-packages/six.py”, line 696, in reraise
raise value
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1345, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1426, in run
run_metadata=run_metadata))
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/validation_hook.py”, line 79, in after_run
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/validation_hook.py”, line 85, in validate
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/evaluation/evaluation.py”, line 165, in evaluate
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/postprocessor/postprocessing.py”, line 146, in cluster_predictions
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/postprocessor/cluster.py”, line 45, in cluster_predictions
AssertionError
Traceback (most recent call last):
File “/usr/local/bin/detectnet_v2”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/entrypoint/detectnet_v2.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.
2021-04-28 14:52:04,113 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.