Training emotionnet with tao toolkit through Jupyter Notebook

Hello,

I’m currently following the tutorial Jupyter Notebook to train and use EmotionNet Model with CK+ Dataset but when i use : !tao emotionnet train -e $SPECS_DIR/emotionnet_tlt_pretrain.yaml \ -r $USER_EXPERIMENT_DIR/experiment_result/exp1 \ -k $KEY
I get this error :

2022-11-30 16:57:04,218 [INFO] root: Registry: ['nvcr.io']
2022-11-30 16:57:04,264 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
2022-11-30 16:57:04,276 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ia/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2022-11-30 15:57:05.068300: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

2022-11-30 15:57:06,920 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

2022-11-30 15:57:08,797 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/emotionnet/scripts/train.py:88: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

2022-11-30 15:57:08,797 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/emotionnet/scripts/train.py:88: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/emotionnet/scripts/train.py:88: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

2022-11-30 15:57:08,797 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/emotionnet/scripts/train.py:88: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

/usr/local/lib/python3.6/dist-packages/driveix/emotionnet/scripts/train.py:118: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
/workspace/tao-experiments/emotionnet/experiment_result/exp1
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/emotionnet/dataloader/emotionnet_dataloader.py:269: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

WARNING 2022-11-30 15:57:09,258| tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/emotionnet/dataloader/emotionnet_dataloader.py:269: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING 2022-11-30 15:57:09,264| tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING 2022-11-30 15:57:09,306| tensorflow: From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:Entity <bound method Processor.__call__ of <modulus.processors.parse_example_proto.ParseExampleProto object at 0x7ff81106a470>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.processors.parse_example_proto.ParseExampleProto object at 0x7ff81106a470>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING 2022-11-30 15:57:09,338| tensorflow: Entity <bound method Processor.__call__ of <modulus.processors.parse_example_proto.ParseExampleProto object at 0x7ff81106a470>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.processors.parse_example_proto.ParseExampleProto object at 0x7ff81106a470>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
Phase: training, num_samples: 884 
 
INFO    2022-11-30 15:57:09,625| /usr/local/lib/python3.6/dist-packages/driveix/emotionnet/trainers/emotionnet_trainer.pyc: steps_per_epoch: 13
INFO    2022-11-30 15:57:09,625| /usr/local/lib/python3.6/dist-packages/driveix/emotionnet/trainers/emotionnet_trainer.pyc: last_step: 650
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING 2022-11-30 15:57:09,628| tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4185: The name tf.truncated_normal is deprecated. Please use tf.random.truncated_normal instead.

WARNING 2022-11-30 15:57:09,633| tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4185: The name tf.truncated_normal is deprecated. Please use tf.random.truncated_normal instead.

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/common/utilities/tlt_utils.py", line 150, in decode_to_keras
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py", line 417, in load_model
    f = h5dict(filepath, 'r')
  File "/usr/local/lib/python3.6/dist-packages/keras/utils/io_utils.py", line 186, in __init__
    self.data = h5py.File(path, mode=mode)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py", line 312, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py", line 142, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 78, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/emotionnet/scripts/train.py", line 155, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/emotionnet/scripts/train.py", line 144, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/emotionnet/trainers/emotionnet_trainer.py", line 174, in build
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/emotionnet/models/emotionnet_model.py", line 149, in build
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/common/utilities/tlt_utils.py", line 190, in model_io
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/common/utilities/tlt_utils.py", line 153, in decode_to_keras
OSError: Invalid decryption. Unable to open file (file signature not found). The key used to load the model is incorrect.
Traceback (most recent call last):
  File "/usr/local/bin/emotionnet", line 8, in <module>
    sys.exit(main())
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/emotionnet/entrypoint/emotionnet.py", line 13, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/common/entrypoint/entrypoint.py", line 300, in launch_job
AssertionError: Process run failed.
2022-11-30 16:57:10,514 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I looked through the forum and haven’t find how to fix it after i tryed some solution i found here.

I’m using a classical computer with a GTX 1050ti with the EmotionNet Network. tlt isn’t find in my terminal.

Thanks !

Please set the correct key.
If you are using ngc pretrained model, please find the key in the ngc model card web page.

What do you mean ? Are you talking about the Nvidia API key or is there a special key for each model ?

Ok nevermind i found the key you are talking about but where am i supposed to add it ?

Add the key in the command line.
-k key

okey i added it thank you ! Now it seems to process but the processing time is really short like 2 minutes and doesn’t improve the model while the jupyter Notebook says it can take several hours to train… It appear that is take only few image from the dataset …

code (112.1 KB)

Could you please double check the dataset ?

I already checked it, and the Jupyter Notebook i use is made to work with this one. I’m currently looking if all the package I use are up to date and well installed in my virtual env and the we will see !

I don’t know wht but it seems that the notebook on my venv use a python3.6 library while i only have python 3.8 installed i cannot find out where it find the older version…

Please check if the model.tlt is already available in your training result folder.
If yes, the training is successful.

Yes a model.tlt file is available but i’ve got element that make me think it just skip the training phase.

First it shows me that :

Phase: training, num_samples: 884 
 
INFO    2022-12-05 08:51:28,409| /usr/local/lib/python3.6/dist-packages/driveix/emotionnet/trainers/emotionnet_trainer.pyc: steps_per_epoch: 13
INFO    2022-12-05 08:51:28,409| /usr/local/lib/python3.6/dist-packages/driveix/emotionnet/trainers/emotionnet_trainer.pyc: last_step: 650

but then few row later it shows me that :

UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.

So i’m surely missing something…

It does not matter.

Could you use a new result folder and run training again?

Yes i tried, each time it create a new result folder like exp1/exp2/exp3/… but the processing time and the result printed in “eval_result.txt” :

Evaluation step 0, loss: 7.283317565917969.
================ =========== ======== ========= ============
precision recall f_score numsamples
================ =========== ======== ========= ============
angry 0 0 0 7
contempt 0 0 0 5
disgust 0.64286 0.81818 0.72 11
happy 0.92857 1 0.96296 13
neutral 0.83721 0.81818 0.82759 44
surprise 1 1 1 14
Average 0.56811 0.60606 0.58509 94
Weighted_average 0.74447 0.76596 0.75375 94
================ =========== ======== ========= ============
============ === === === === === ===
0 1 2 3 4 5
============ === === === === === ===
neutral (0) 36 1 0 7 0 0
happy (1) 0 13 0 0 0 0
surprise (2) 0 0 14 0 0 0
contempt (3) 5 0 0 0 0 0
disgust (4) 0 0 0 2 9 0
angry (5) 2 0 0 0 5 0
============ === === === === === ===

make me think that not all are used… I don’t know…

Can you save the .ipynb file as .html file and upload here? Thanks.

This one ?
emotionnet - Jupyter Notebook.html (1.2 MB)

The training part is at part 5

I get similar result as yours. There is no issue in the training.
Please set longer epochs.

hmm, okey i tried that and the processing time still seems really low for the size of the dataset but i will figure it out.
Also, i don’t understand this result:

Evaluation step 2587, loss: 7.091771125793457.
================ =========== ======== ========= ============
precision recall f_score numsamples
================ =========== ======== ========= ============
angry 0 0 0 16
contempt 0 0 0 3
disgust 0.33333 1 0.5 6
happy 1 1 1 14
neutral 0.88889 1 0.94118 40
surprise 1 1 1 14
Average 0.53704 0.66667 0.57353 93
Weighted_average 0.7049 0.7957 0.73814 93
================ =========== ======== ========= ============
============ === === === === === ===
0 1 2 3 4 5
============ === === === === === ===
neutral (0) 40 0 0 0 0 0
happy (1) 0 14 0 0 0 0
surprise (2) 0 0 14 0 0 0
contempt (3) 3 0 0 0 0 0
disgust (4) 0 0 0 0 6 0
angry (5) 2 0 0 2 12 0
============ === === === === === ===

how can the precision and recall of “angry” and “contempt” be 0 from the beginning to the end while i use the Nvidia pretrained EmotionNet model and i train over it with more data with all the 6 different emotions ?
Even if my training was bad it should have an other value than 0 at least at the beginning.

It is normal. Because the “angry” class has 14 images and “contempt” has only 2 images.
From the default training result,
$ cat eval_results.txt

The two “contempt” images are classified to “neutral” class.

ohh okey, so my dataset isn’t set correctly because it is supposed to contain something like 1200 images while in the numsamples it only shows 279 samples, right ?

During my training log,
Phase: training, num_samples: 884
Phase: validation, num_samples: 93

You can check your training log.
For further experiments, you can set more training images or test images.