Tao GestureNet train do not work properly

• Hardware (etc, Ubuntu 18.04, Geforce RTX 2060 super)
• Network Type (GestureNet)
• TLT Version (v3.21.11)
• Training spec file(train_spec.json file uploaded.)
• How to reproduce the issue ? (Just run GestureNet Jupyter Notebook without any modification except but project directory path)
train_spec.json (2.7 KB)

Hi.
When I trying to run ‘tao gesturenet train’ through Jupyter notebook it always finished before reaches number of epoch which set in train_spec.json.

Can I get some advice for this issue?

Thanks.

!tao gesturenet train -e $SPECS_DIR/train_spec.json -k $KEY

/home/my-desktop/launcher/launcher/lib/python3.6/site-packages/tlt/init.py:20: DeprecationWarning:
The nvidia-tlt package will be deprecated soon. Going forward please migrate to using the nvidia-tao package.

warnings.warn(message, DeprecationWarning)
2021-12-09 15:14:11,528 [INFO] root: Registry: [‘nvcr.io’]
2021-12-09 15:14:11,568 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2021-12-09 15:14:11,577 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/samjuok-desktop/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
2021-12-09 06:14:12.279738: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
/workspace/tao-experiments/gesturenet/model
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/classifynet/trainer/classifynet_trainer.py:42: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

2021-12-09 06:14:16,887 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/classifynet/trainer/classifynet_trainer.py:42: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/classifynet/trainer/classifynet_trainer.py:45: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2021-12-09 06:14:16,887 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/classifynet/trainer/classifynet_trainer.py:45: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/classifynet/trainer/classifynet_trainer.py:62: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

2021-12-09 06:14:17,148 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/classifynet/trainer/classifynet_trainer.py:62: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/classifynet/trainer/classifynet_trainer.py:62: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

2021-12-09 06:14:17,148 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/classifynet/trainer/classifynet_trainer.py:62: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

2021-12-09 06:14:17,152 [INFO] driveix.classifynet.trainer.classifynet_trainer: Processed dataset (train): 418
2021-12-09 06:14:17,152 [INFO] driveix.classifynet.trainer.classifynet_trainer: Processed dataset (val): 157
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2021-12-09 06:14:17,158 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

2021-12-09 06:14:17,158 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

2021-12-09 06:14:17,159 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

2021-12-09 06:14:17,174 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

2021-12-09 06:14:17,178 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

2021-12-09 06:14:17,855 [WARNING] tensorflow: From /opt/nvidia/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2021-12-09 06:14:18,822 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2021-12-09 06:14:18,822 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

2021-12-09 06:14:18,823 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2021-12-09 06:14:18,958 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2021-12-09 06:14:19,185 [INFO] driveix.classifynet.models.resnet_vanilla: Model loaded successfully: /workspace/tao-experiments/gesturenet/pretrained_models/gesturenet_vtrainable_v1.0/model.tlt
2021-12-09 06:14:19,185 [INFO] /usr/local/lib/python3.6/dist-packages/driveix/classifynet/models/classifynet_model.pyc: Successfully built model: resnet_vanilla
2021-12-09 06:14:19,186 [INFO] driveix.classifynet.trainer.classifynet_trainer: Built model: resnet_vanilla
2021-12-09 06:14:19,186 [INFO] driveix.classifynet.trainer.classifynet_trainer: Build top training loss: categorical_crossentropy
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

2021-12-09 06:14:19,332 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.

2021-12-09 06:14:19,335 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

2021-12-09 06:14:19,351 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:850: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

2021-12-09 06:14:19,434 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:850: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:853: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

2021-12-09 06:14:19,435 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:853: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

2021-12-09 06:14:19,617 [INFO] driveix.classifynet.trainer.classifynet_trainer: Finished training top model.
2021-12-09 06:14:19,617 [INFO] driveix.classifynet.trainer.classifynet_trainer: Build fine-tuning loss: categorical_crossentropy
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

2021-12-09 06:14:19,879 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

Epoch 1/50
2/418 […] - ETA: 7:05 - loss: 1.8406 - categorical_accuracy: 0.5000 /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.137394). Check your callbacks.
% delta_t_median)
418/418 [==============================] - 6s 14ms/step - loss: 1.2961 - categorical_accuracy: 0.5550 - val_loss: 0.8772 - val_categorical_accuracy: 0.7898
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:995: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2021-12-09 06:14:26,098 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/callbacks.py:995: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

d9e63718f216:42:68 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
d9e63718f216:42:68 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
d9e63718f216:42:68 [0] NCCL INFO NET/IB : No device found.
d9e63718f216:42:68 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
d9e63718f216:42:68 [0] NCCL INFO Using network Socket
NCCL version 2.9.9+cuda11.3
d9e63718f216:42:68 [0] NCCL INFO Channel 00/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 01/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 02/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 03/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 04/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 05/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 06/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 07/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 08/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 09/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 10/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 11/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 12/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 13/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 14/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 15/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 16/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 17/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 18/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 19/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 20/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 21/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 22/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 23/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 24/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 25/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 26/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 27/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 28/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 29/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 30/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Channel 31/32 : 0
d9e63718f216:42:68 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
d9e63718f216:42:68 [0] NCCL INFO Connected all rings
d9e63718f216:42:68 [0] NCCL INFO Connected all trees
d9e63718f216:42:68 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
d9e63718f216:42:68 [0] NCCL INFO comm 0x7f849c326930 rank 0 nranks 1 cudaDev 0 busId 1000 - Init COMPLETE
Epoch 2/50
418/418 [==============================] - 4s 9ms/step - loss: 1.2776 - categorical_accuracy: 0.5550 - val_loss: 0.9632 - val_categorical_accuracy: 0.8025
Epoch 3/50
418/418 [==============================] - 4s 9ms/step - loss: 1.2629 - categorical_accuracy: 0.5574 - val_loss: 0.9095 - val_categorical_accuracy: 0.7898
Epoch 4/50
418/418 [==============================] - 4s 9ms/step - loss: 1.2383 - categorical_accuracy: 0.5837 - val_loss: 0.9990 - val_categorical_accuracy: 0.7898
Epoch 5/50
418/418 [==============================] - 4s 9ms/step - loss: 1.2357 - categorical_accuracy: 0.5622 - val_loss: 0.9682 - val_categorical_accuracy: 0.7962
Epoch 6/50
418/418 [==============================] - 4s 9ms/step - loss: 1.2418 - categorical_accuracy: 0.5646 - val_loss: 0.9799 - val_categorical_accuracy: 0.7898
Epoch 7/50
418/418 [==============================] - 4s 9ms/step - loss: 1.2264 - categorical_accuracy: 0.5694 - val_loss: 0.9783 - val_categorical_accuracy: 0.7834
Epoch 8/50
418/418 [==============================] - 4s 9ms/step - loss: 1.1957 - categorical_accuracy: 0.5861 - val_loss: 0.9327 - val_categorical_accuracy: 0.7834
Epoch 9/50
418/418 [==============================] - 4s 9ms/step - loss: 1.2467 - categorical_accuracy: 0.5933 - val_loss: 0.9586 - val_categorical_accuracy: 0.7962
Epoch 10/50
418/418 [==============================] - 4s 9ms/step - loss: 1.2283 - categorical_accuracy: 0.5694 - val_loss: 0.9970 - val_categorical_accuracy: 0.7898
Epoch 11/50
418/418 [==============================] - 4s 9ms/step - loss: 1.2101 - categorical_accuracy: 0.5694 - val_loss: 0.9748 - val_categorical_accuracy: 0.7962
2021-12-09 06:15:06,968 [INFO] driveix.classifynet.trainer.classifynet_trainer: Finished training full model
2021-12-09 06:15:09,252 [INFO] driveix.classifynet.trainer.classifynet_trainer: Total Val Loss: 0.974833607673645
2021-12-09 06:15:09,252 [INFO] driveix.classifynet.trainer.classifynet_trainer: Total Val accuracy: 0.7961783409118652
2021-12-09 06:15:09,253 [INFO] main: Training finished successfully.
2021-12-09 15:15:10,426 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

It is due to early_stopping.
Could you please share the result folder?
! ls -rlt $USER_EXPERIMENT_DIR/model/

model.zip (89.1 MB)

Thank you for your reply.
Please, refer uploaded file.