Unable to train image classification tf2

Please provide the following information when requesting support.

• Hardware (dGPU)
• Network Type (Classification_tf2 )
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
“dockers:
nvidia/tao/tao-toolkit:
4.0.0-tf2.9.1:
docker_registry: nvcr.io
tasks:
1. classification_tf2
2. efficientdet_tf2
4.0.0-tf1.15.5:
docker_registry: nvcr.io
tasks:
1. augment
2. bpnet
3. classification_tf1
4. detectnet_v2
5. dssd
6. emotionnet
7. efficientdet_tf1
8. faster_rcnn
9. fpenet
10. gazenet
11. gesturenet
12. heartratenet
13. lprnet
14. mask_rcnn
15. multitask_classification
16. retinanet
17. ssd
18. unet
19. yolo_v3
20. yolo_v4
21. yolo_v4_tiny
22. converter
4.0.0-pyt:
docker_registry: nvcr.io
tasks:
1. action_recognition
2. deformable_detr
3. segformer
4. re_identification
5. pointpillars
6. pose_classification
7. n_gram
8. speech_to_text
9. speech_to_text_citrinet
10. speech_to_text_conformer
11. spectro_gen
12. vocoder
13. text_classification
14. question_answering
15. token_classification
16. intent_slot_classification
17. punctuation_and_capitalization
format_version: 2.0
toolkit_version: 4.0.0
published_date: 12/08/2022”
• Training spec file(If have, please share here) “results_dir: ‘/workspace/tao-experiments/classification_tf2/output’
key: ‘key’
data:
train_dataset_path: “/workspace/tao-experiments/data/split/train”
val_dataset_path: “/workspace/tao-experiments/data/split/val”
preprocess_mode: ‘torch’
augment:
enable_color_augmentation: True
enable_center_crop: True
train:
qat: True
pretrained_model_path: ‘/home/getting_started_v4.0.0/notebooks/tao_launcher_starter_kit/classification_tf2/pretrained_efficientnet_b0/pretrained_classification_tf2_vefficientnet_b0/’
batch_size_per_gpu: 2
num_epochs: 80
optim_config:
optimizer: ‘sgd’
lr_config:
scheduler: ‘cosine’
learning_rate: 0.05
soft_start: 0.05
reg_config:
type: ‘L2’
scope: [‘conv2d’, ‘dense’]
weight_decay: 0.00005
model:

model_path: ‘EVALMODEL’
output_path: ‘PRUNEDMODEL’
threshold: 0.68
byom_model_path: ‘’”
• How to reproduce the issue ? (!tao classification_tf2 train -e $SPECS_DIR/spec.yaml)

The error is message is:

/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.13) or chardet (3.0.4) doesn’t match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn’t match a supported "
2023-01-03 12:38:30,902 [INFO] root: Registry: [‘nvcr.io’]
2023-01-03 12:38:30,930 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf2.9.1
2023-01-03 12:38:33,879 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Above message is not the error.

To debug, please open a terminal and run inside the docker.
$ tao classification_tf2 run /bin/bash

Then it is already running inside the docker.
# classification_tf2 train yourspec.yaml

Usually, please pay attention to setting correct ~/.tao_mounts.json file.

I have no name!@9a2d9b96d20a:/opt/nvidia$ should I run the command inside this path

I have no name!@9a2d9b96d20a:/workspace/tao-experiments$ classification_tf2 train classification_tf2/tao_voc/specs/spec.yaml
Illegal instruction

I am getting this error

It is due to old CPU.
Please check with the help of Core dump Illegal Instruction on detectnet_v2 example
Or please try another host PC.

When I try to run the notebook, the container stops without any error message…
What could be causing this?

This is the error message. It is due to old CPU which is missing AVX2.

python3 -c 'import tensorflow as tf;

The TensorFlow library was compiled to use SSE4.1 instructions, but these aren’t available on your machine.
Aborted (core dumped)

Please refer to Core dump Illegal Instruction on detectnet_v2 example - #16 by project2kq54

The avx2 is fixed…

Now I am getting the following error:
Telemetry data couldn’t be sent, but the command ran successfully.

keras_metadata.pb saved_model.pb variables

What models should I use for pretrained_classification_tf2?
I don’t have .hdf5 when I try to download the model from ngc.

print(“Check that model is downloaded into dir.”)

!ls -l $LOCAL_EXPERIMENT_DIR/pretrained_efficientnet_b0/pretrained_classification_tf2_vefficientnet_b0

OUTPUT::
Check that model is downloaded into dir.
total 4980
-rw------- 1 abc abc 506069 Jan 2 16:45 keras_metadata.pb
-rw------- 1 abc abc 4584557 Jan 2 16:45 saved_model.pb
drwx------ 2 abc abc 4096 Jan 2 16:45 variables

How do I pass the right pretrained model to the pretrained _model_path parameter.

/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.13) or chardet (3.0.4) doesn’t match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn’t match a supported "
2023-01-03 15:16:16,444 [INFO] root: Registry: [‘nvcr.io’]
2023-01-03 15:16:16,469 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf2.9.1
[1672739182.831962] [e1708cd770a4:319 :f] vfs_fuse.c:424 UCX WARN failed to connect to vfs socket ‘’: Invalid argument
Setting up communication with ClearML server.
ClearML task init failed with error ClearML configuration could not be found (missing ~/clearml.conf or Environment CLEARML_API_HOST)
To get started with ClearML: setup your own clearml-server, or create a free account at https://app.clear.ml
Training will still continue.
Starting classification training.
Found 8134 images belonging to 2 classes.
Processing dataset (train): /workspace/tao-experiments/data/split/train
Found 1692 images belonging to 2 classes.
Processing dataset (validation): /workspace/tao-experiments/data/split/val
Only .hdf5, .tlt, .tltb are supported.
Error executing job with overrides:
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 211, in run_and_report
return func()
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 368, in
lambda: hydra.run(
File “/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py”, line 88, in _patched_hydra_run
return PatchHydra._original_hydra_run(self, config_name, task_function, overrides, *args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py”, line 110, in run
_ = ret.return_value
File “/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py”, line 233, in return_value
raise self._return_value
File “/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py”, line 160, in run_job
ret.return_value = task_function(task_cfg)
File “/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py”, line 170, in _patched_task_function
return task_function(a_config, *a_args, **a_kwargs)
File “”, line 408, in main
File “”, line 76, in _func
File “”, line 49, in _func
File “”, line 319, in run_experiment
File “”, line 364, in load_model
AssertionError: Only .hdf5, .tlt, .tltb are supported.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “</usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/classification/scripts/train.py>”, line 3, in
File “”, line 412, in
File “”, line 87, in wrapper
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 367, in _run_hydra
run_and_report(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 251, in run_and_report
assert mdl is not None
AssertionError
Sending telemetry data.
Telemetry data couldn’t be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-01-03 15:16:26,615 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The ngc pretrained models are available in TAO Pretrained EfficientDet | NVIDIA NGC and TAO Pretrained Classification | NVIDIA NGC

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.