Error in TAO-toolkit classification_tf2 train

• Hardware (Tesla P40)
• Network Type (Classification)
• nvidai-tao version: 5.2.0.1

I am running a classification_tf2 example from v5.1.0 and my command is,

tao model classification_tf2 train -e path/to/spec/bind/mount 

but i am getting this error,

2024-01-22 18:48:31,242 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-01-22 18:48:31,502 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0
2024-01-22 18:48:33,158 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-01-22 13:18:35.096853: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Error executing job with overrides: []
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 88, in _patched_hydra_run
    return PatchHydra._original_hydra_run(self, config_name, task_function, overrides, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 170, in _patched_task_function
    return task_function(a_config, *a_args, **a_kwargs)
  File "<frozen cv.classification.scripts.train>", line 215, in main
  File "<frozen common.utils>", line 62, in update_results_dir
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 369, in __getitem__
    self._format_and_raise(
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/base.py", line 190, in _format_and_raise
    format_and_raise(
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/_utils.py", line 741, in format_and_raise
    _raise(ex, cause)
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/_utils.py", line 719, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 367, in __getitem__
    return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 438, in _get_impl
    node = self._get_node(key=key, throw_on_missing_key=True)
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 465, in _get_node
    self._validate_get(key)
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 166, in _validate_get
    self._format_and_raise(
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/base.py", line 190, in _format_and_raise
    format_and_raise(
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/_utils.py", line 821, in format_and_raise
    _raise(ex, cause)
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/_utils.py", line 719, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ConfigKeyError: Key 'results_dir' is not in struct
    full_key: train.results_dir
    object_type=dict

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/classification/scripts/train.py>", line 3, in <module>
  File "<frozen cv.classification.scripts.train>", line 221, in <module>
  File "<frozen common.hydra.hydra_runner>", line 99, in wrapper
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
    assert mdl is not None
AssertionError
Sending telemetry data.
Execution status: FAIL
2024-01-22 18:48:54,045 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

I have put the key results_dir in the spec file, here it is,

results_dir: '/workspace/'
encryption_key: 'nvidia_tlt'
dataset:
  train_dataset_path: "/workspace/tao-experiments/data/split/train"
  val_dataset_path: "/workspace/tao-experiments/data/split/val"
  preprocess_mode: 'torch'
  num_classes: 2
  augmentation:
    enable_color_augmentation: True
    enable_center_crop: True
train:
  qat: False
  checkpoint: ''
  batch_size_per_gpu: 64
  num_epochs: 5
  optim_config:
    optimizer: 'sgd'
  lr_config:
    scheduler: 'cosine'
    learning_rate: 0.05
    soft_start: 0.05
  reg_config:
    type: 'L2'
    scope: ['conv2d', 'dense']
    weight_decay: 0.00005
model:
  backbone: 'byom'
  input_width: 227
  input_height: 227
  input_channels: 3
  input_image_depth: 8
  byom_model: '/workspace/tao-experiments/gender_net.tltb'
evaluate:
  dataset_path: "/workspace/tao-experiments/data/split/test"
  checkpoint: "/workspace/tao-experiments/class_net.tltb"
  top_k: 3
  batch_size: 256
  n_workers: 8
prune:
  checkpoint: '/workspace/tao-experiments/class_net.tltb'
  threshold: 0.68
  byom_model_path: '/workspace/tao-experiments/class_net.tltb'

Any idea what’s causing the issue?

Please add .yaml extension to the spec file.
i.e.,
tao model classification_tf2 train -e path/to/spec/bind/mount/spec.yaml

Hey @Morganh thanks for the reply. I just used the path/to/spec/bind/mount for informative purpose. My actual command contains the .yaml part. I have made sure that the file exists in my file system and the bind mounts are also configured correctly.
FYI, the full command is below,

tao model classification_tf2 train -e /workspace/tao-experiments/classification_tf2/byom_voc/specs/spec.yml 

I cannot reproduce above error with tao_tutorials/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc/specs/spec.yaml at main · NVIDIA/tao_tutorials · GitHub.
To narrow down, could you help run with this yaml file? The main change is to change the model’s backbone.

@Morganh I found the error, i was using a yml file extension, instead when i renamed it to a .yaml file, it worked.
Thanks for the support.

1 Like

Great. Thanks for the info.

Getting this error now,

2024-01-23 12:33:59,534 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-01-23 12:33:59,809 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0
2024-01-23 12:34:01,571 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-01-23 07:04:03.484110: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Train results will be saved at: /workspace/train
[1705993455.594802] [2b86799a5f5c:283  :f]        vfs_fuse.c:424  UCX  WARN  failed to connect to vfs socket '���': Invalid argument
Setting up communication with ClearML server.
ClearML task init failed with error ClearML configuration could not be found (missing `~/clearml.conf` or Environment CLEARML_API_HOST)
To get started with ClearML: setup your own `clearml-server`, or create a free account at https://app.clear.ml
Training will still continue.
Starting classification training.
Found 122336 images belonging to 2 classes.
Processing dataset (train): /workspace/tao-experiments/data/split/train
Found 122336 images belonging to 2 classes.
Processing dataset (validation): /workspace/tao-experiments/data/split/val
cannot import name 'InputSpec' from 'keras.engine' (/usr/local/lib/python3.8/dist-packages/keras/engine/__init__.py)
Error executing job with overrides: []
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 88, in _patched_hydra_run
    return PatchHydra._original_hydra_run(self, config_name, task_function, overrides, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 170, in _patched_task_function
    return task_function(a_config, *a_args, **a_kwargs)
  File "<frozen cv.classification.scripts.train>", line 217, in main
  File "<frozen common.decorators>", line 88, in _func
  File "<frozen common.decorators>", line 61, in _func
  File "<frozen cv.classification.scripts.train>", line 178, in run_experiment
  File "<frozen cv.classification.model.classifier_module>", line 42, in __init__
  File "<frozen cv.classification.model.classifier_module>", line 82, in _build_models
  File "<frozen cv.classification.model.model_builder>", line 579, in get_model
  File "<frozen cv.classification.model.model_builder>", line 118, in get_byom
  File "<frozen cv.classification.utils.helper>", line 345, in decode_tltb
  File "<frozen cv.classification.utils.helper>", line 322, in deserialize_custom_layers
  File "<string>", line 7, in <module>
ImportError: cannot import name 'InputSpec' from 'keras.engine' (/usr/local/lib/python3.8/dist-packages/keras/engine/__init__.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/classification/scripts/train.py>", line 3, in <module>
  File "<frozen cv.classification.scripts.train>", line 221, in <module>
  File "<frozen common.hydra.hydra_runner>", line 99, in wrapper
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
    assert mdl is not None
AssertionError
Sending telemetry data.
Execution status: FAIL
2024-01-23 12:34:38,063 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

@Morganh
I think during deserialisation, i have custom layer in model. Can you help me with what needs to be done for this?

Should be related to the byom model in your spec file.
How did you generate it? Did you follow tao_tutorials/notebooks/tao_launcher_starter_kit/classification_tf2/byom_voc/byom_classification.ipynb at main · NVIDIA/tao_tutorials · GitHub and tao_byom_examples/classification at main · NVIDIA-AI-IOT/tao_byom_examples · GitHub?

Yes I generated it using tao_byom_converter. I originally had caffemodel format weights. I converted them to onnx via caffe2onnx converter from PyPi package. I tested the weights in caffemodel and onnx converted formats and they had same accuracy. Post that I converted the onnx weights to tltb.

Please try below to check if it works.
$ docker run -it --rm --net=host --gpus all -v /local/folder:/docker/folder nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0 /bin/bash

Then
$ mv /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/classification/utils/helper.py /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/classification/utils/helper.py.bak

$ vim /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/classification/utils/helper.py (copy from tao_tensorflow2_backend/nvidia_tao_tf2/cv/classification/utils/helper.py at main · NVIDIA/tao_tensorflow2_backend · GitHub)
Then modify

319    source_code = art.get_content()
320    spec = importlib.util.spec_from_loader('helper', loader=None)

to

    source_code = art.get_content()
    bak = source_code.split("\n")
    bak[6] = "from tensorflow.keras.layers import InputSpec"
    bak[682] = "class ZeroPadding1D_NCW(keras.layers.ZeroPadding1D):"
    source_code = "\n".join(bak)
    spec = importlib.util.spec_from_loader('helper', loader=None)

@Morganh getting following error on,

classification_tf2 train -e specs/spec.yaml

Output:

root@jarvis11:~# classification_tf2 train -e specs/spec.yaml
2024-01-23 09:37:42.164668: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Train results will be saved at: /root/train
Setting up communication with ClearML server.
ClearML task init failed with error ClearML configuration could not be found (missing `~/clearml.conf` or Environment CLEARML_API_HOST)
To get started with ClearML: setup your own `clearml-server`, or create a free account at https://app.clear.ml
Training will still continue.
Starting classification training.
Found 122336 images belonging to 2 classes.
Processing dataset (train): /root/data/split/train
Found 122336 images belonging to 2 classes.
Processing dataset (validation): /root/data/split/val
bad marshal data (unknown type code)
Error executing job with overrides: []
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 88, in _patched_hydra_run
    return PatchHydra._original_hydra_run(self, config_name, task_function, overrides, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 170, in _patched_task_function
    return task_function(a_config, *a_args, **a_kwargs)
  File "<frozen cv.classification.scripts.train>", line 217, in main
  File "<frozen common.decorators>", line 88, in _func
  File "<frozen common.decorators>", line 61, in _func
  File "<frozen cv.classification.scripts.train>", line 178, in run_experiment
  File "<frozen cv.classification.model.classifier_module>", line 42, in __init__
  File "<frozen cv.classification.model.classifier_module>", line 82, in _build_models
  File "<frozen cv.classification.model.model_builder>", line 579, in get_model
  File "<frozen cv.classification.model.model_builder>", line 118, in get_byom
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/classification/utils/helper.py", line 360, in decode_tltb
    model = keras.models.model_from_config(m, custom_objects=EFF_CUSTOM_OBJS)
  File "/usr/local/lib/python3.8/dist-packages/keras/saving/legacy/model_config.py", line 55, in model_from_config
    return deserialize(config, custom_objects=custom_objects)
  File "/usr/local/lib/python3.8/dist-packages/keras/layers/serialization.py", line 252, in deserialize
    return serialization.deserialize_keras_object(
  File "/usr/local/lib/python3.8/dist-packages/keras/saving/legacy/serialization.py", line 517, in deserialize_keras_object
    deserialized_obj = cls.from_config(
  File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 3114, in from_config
    inputs, outputs, layers = functional.reconstruct_from_config(
  File "/usr/local/lib/python3.8/dist-packages/keras/engine/functional.py", line 1470, in reconstruct_from_config
    process_layer(layer_data)
  File "/usr/local/lib/python3.8/dist-packages/keras/engine/functional.py", line 1451, in process_layer
    layer = deserialize_layer(layer_data, custom_objects=custom_objects)
  File "/usr/local/lib/python3.8/dist-packages/keras/layers/serialization.py", line 252, in deserialize
    return serialization.deserialize_keras_object(
  File "/usr/local/lib/python3.8/dist-packages/keras/saving/legacy/serialization.py", line 517, in deserialize_keras_object
    deserialized_obj = cls.from_config(
  File "/usr/local/lib/python3.8/dist-packages/keras/layers/core/lambda_layer.py", line 324, in from_config
    function = cls._parse_function_from_config(
  File "/usr/local/lib/python3.8/dist-packages/keras/layers/core/lambda_layer.py", line 391, in _parse_function_from_config
    function = generic_utils.func_load(
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/generic_utils.py", line 103, in func_load
    code = marshal.loads(raw_code)
ValueError: bad marshal data (unknown type code)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/classification/scripts/train.py>", line 3, in <module>
  File "<frozen cv.classification.scripts.train>", line 221, in <module>
  File "<frozen common.hydra.hydra_runner>", line 99, in wrapper
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
    assert mdl is not None
AssertionError
Sending telemetry data.
Execution status: FAIL

Could you double check above-mentioned steps? I can not reproduce the latest error.

Here was the original part after copying from tao_tensorflow2_backend,

309 def deserialize_custom_layers(art):
310     """Deserialize the code for custom layer from EFF.
311
312     Args:
313         art (eff.core.artifact.Artifact): Artifact restored from EFF Archive.
314
315     Returns:
316         final_dict (dict): Dictionary representing CUSTOM_OBJS used in the EFF stored Keras model.
317     """
318     # Get class.
319     source_code = art.get_content()
320     spec = importlib.util.spec_from_loader('helper', loader=None)
321     helper = importlib.util.module_from_spec(spec)
322     exec(source_code, helper.__dict__) # noqa pylint: disable=W0122
323
324     final_dict = {}
325     # Get class name from attributes.
326     class_names = art["class_names"]
327     for cn in class_names:
328         final_dict[cn] = getattr(helper, cn)
329     return final_dict

and after changing, the code looks like this,

309 def deserialize_custom_layers(art):
310     """Deserialize the code for custom layer from EFF.
311
312     Args:
313         art (eff.core.artifact.Artifact): Artifact restored from EFF Archive.
314
315     Returns:
316         final_dict (dict): Dictionary representing CUSTOM_OBJS used in the EFF stored Keras model.
317     """
318     # Get class.
319     # source_code = art.get_content()
320     # spec = importlib.util.spec_from_loader('helper', loader=None)
321     source_code = art.get_content()
322     bak = source_code.split("\n")
323     bak[6] = "from tensorflow.keras.layers import InputSpec"
324     bak[682] = "class ZeroPadding1D_NCW(keras.layers.ZeroPadding1D):"
325     source_code = "\n".join(bak)
326     spec = importlib.util.spec_from_loader('helper', loader=None)
327
328     helper = importlib.util.module_from_spec(spec)
329     exec(source_code, helper.__dict__) # noqa pylint: disable=W0122
330
331     final_dict = {}
332     # Get class name from attributes.
333     class_names = art["class_names"]
334     for cn in class_names:
335         final_dict[cn] = getattr(helper, cn)
336     return final_dict

Still same error.

@Morganh if it helps, I can share weights also. Please let me know.

Yes, you can share me with the .tltb file.
More, you can also exit the docker and run docker run and steps again to double check.

@Morganh I have tried double checking, it always gives the same error. Here is the tltb file url

Can reproduce ValueError: bad marshal data (unknown type code) with your tltb file.
Is the onnx file yours or 3rd-party public one?
If not, please try to follow BYOM Converter - NVIDIA Docs to generate a new one.

@Morganh yes the onnx is 3rd-party model. What can I do in this case?

You can change another onnx file and retry. See BYOM Converter - NVIDIA DocsGitHub - NVIDIA-AI-IOT/tao_byom_examples: Examples of converting different open-source deep learning models to TAO compatible format through TAO BYOM package.tao_byom_examples/classification at main · NVIDIA-AI-IOT/tao_byom_examples · GitHub
i.e.,
Follow GitHub - NVIDIA-AI-IOT/tao_byom_examples: Examples of converting different open-source deep learning models to TAO compatible format through TAO BYOM package. to “Install Python Dependencies”, then
$ git clone https://gitlab-master.nvidia.com/tlt/tao-byom-example.git
$ python export_torchvision.py -m resnet18
$ tao_byom -m onnx_models/resnet18.onnx -r results/resnet18 -n resnet18 -k nvidia_tlt -p 188

The resnet18.tltb will be generated.

@Morganh my original model was based on caffenet, and it is not in the list of tested models on tao-byom-examples. I have a googlenet version of the same model, I think I will use the googlenet weights and convert them to onnx and tltb.

So caffenet is not supported by tao-byom?