Unable to train image classification tf2

Ni_Fury · January 3, 2023, 7:52am

Please provide the following information when requesting support.

• Hardware (dGPU)
• Network Type (Classification_tf2 )
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
“dockers:
nvidia/tao/tao-toolkit:
4.0.0-tf2.9.1:
docker_registry: nvcr.io
tasks:
1. classification_tf2
2. efficientdet_tf2
4.0.0-tf1.15.5:
docker_registry: nvcr.io
tasks:
1. augment
2. bpnet
3. classification_tf1
4. detectnet_v2
5. dssd
6. emotionnet
7. efficientdet_tf1
8. faster_rcnn
9. fpenet
10. gazenet
11. gesturenet
12. heartratenet
13. lprnet
14. mask_rcnn
15. multitask_classification
16. retinanet
17. ssd
18. unet
19. yolo_v3
20. yolo_v4
21. yolo_v4_tiny
22. converter
4.0.0-pyt:
docker_registry: nvcr.io
tasks:
1. action_recognition
2. deformable_detr
3. segformer
4. re_identification
5. pointpillars
6. pose_classification
7. n_gram
8. speech_to_text
9. speech_to_text_citrinet
10. speech_to_text_conformer
11. spectro_gen
12. vocoder
13. text_classification
14. question_answering
15. token_classification
16. intent_slot_classification
17. punctuation_and_capitalization
format_version: 2.0
toolkit_version: 4.0.0
published_date: 12/08/2022”
• Training spec file(If have, please share here) “results_dir: ‘/workspace/tao-experiments/classification_tf2/output’
key: ‘key’
data:
train_dataset_path: “/workspace/tao-experiments/data/split/train”
val_dataset_path: “/workspace/tao-experiments/data/split/val”
preprocess_mode: ‘torch’
augment:
enable_color_augmentation: True
enable_center_crop: True
train:
qat: True
pretrained_model_path: ‘/home/getting_started_v4.0.0/notebooks/tao_launcher_starter_kit/classification_tf2/pretrained_efficientnet_b0/pretrained_classification_tf2_vefficientnet_b0/’
batch_size_per_gpu: 2
num_epochs: 80
optim_config:
optimizer: ‘sgd’
lr_config:
scheduler: ‘cosine’
learning_rate: 0.05
soft_start: 0.05
reg_config:
type: ‘L2’
scope: [‘conv2d’, ‘dense’]
weight_decay: 0.00005
model:
…
model_path: ‘EVALMODEL’
output_path: ‘PRUNEDMODEL’
threshold: 0.68
byom_model_path: ‘’”
• How to reproduce the issue ? (!tao classification_tf2 train -e $SPECS_DIR/spec.yaml)

The error is message is:

/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.13) or chardet (3.0.4) doesn’t match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn’t match a supported "
2023-01-03 12:38:30,902 [INFO] root: Registry: [‘nvcr.io’]
2023-01-03 12:38:30,930 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf2.9.1
2023-01-03 12:38:33,879 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Morganh · January 3, 2023, 8:19am

Above message is not the error.

To debug, please open a terminal and run inside the docker.
$ tao classification_tf2 run /bin/bash

Then it is already running inside the docker.
# classification_tf2 train yourspec.yaml

Usually, please pay attention to setting correct ~/.tao_mounts.json file.

Ni_Fury · January 3, 2023, 8:22am

I have no name!@9a2d9b96d20a:/opt/nvidia$ should I run the command inside this path

Ni_Fury · January 3, 2023, 8:24am

I have no name!@9a2d9b96d20a:/workspace/tao-experiments$ classification_tf2 train classification_tf2/tao_voc/specs/spec.yaml
Illegal instruction

I am getting this error

Morganh · January 3, 2023, 8:27am

It is due to old CPU.
Please check with the help of Core dump Illegal Instruction on detectnet_v2 example
Or please try another host PC.

Ni_Fury · January 3, 2023, 8:29am

When I try to run the notebook, the container stops without any error message…
What could be causing this?

Morganh · January 3, 2023, 8:33am

This is the error message. It is due to old CPU which is missing AVX2.

Ni_Fury · January 3, 2023, 8:38am

python3 -c 'import tensorflow as tf;
’
The TensorFlow library was compiled to use SSE4.1 instructions, but these aren’t available on your machine.
Aborted (core dumped)

Morganh · January 3, 2023, 8:42am

Please refer to Core dump Illegal Instruction on detectnet_v2 example - #16 by project2kq54

Ni_Fury · January 3, 2023, 9:02am

The avx2 is fixed…

Now I am getting the following error:
Telemetry data couldn’t be sent, but the command ran successfully.

Ni_Fury · January 3, 2023, 9:20am

keras_metadata.pb saved_model.pb variables

What models should I use for pretrained_classification_tf2?
I don’t have .hdf5 when I try to download the model from ngc.

Ni_Fury · January 3, 2023, 9:21am

print(“Check that model is downloaded into dir.”)

!ls -l $LOCAL_EXPERIMENT_DIR/pretrained_efficientnet_b0/pretrained_classification_tf2_vefficientnet_b0

OUTPUT::
Check that model is downloaded into dir.
total 4980
-rw------- 1 abc abc 506069 Jan 2 16:45 keras_metadata.pb
-rw------- 1 abc abc 4584557 Jan 2 16:45 saved_model.pb
drwx------ 2 abc abc 4096 Jan 2 16:45 variables

Ni_Fury · January 3, 2023, 9:24am

How do I pass the right pretrained model to the pretrained _model_path parameter.

Ni_Fury · January 3, 2023, 9:47am

/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.13) or chardet (3.0.4) doesn’t match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn’t match a supported "
2023-01-03 15:16:16,444 [INFO] root: Registry: [‘nvcr.io’]
2023-01-03 15:16:16,469 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf2.9.1
[1672739182.831962] [e1708cd770a4:319 :f] vfs_fuse.c:424 UCX WARN failed to connect to vfs socket ‘’: Invalid argument
Setting up communication with ClearML server.
ClearML task init failed with error ClearML configuration could not be found (missing ~/clearml.conf or Environment CLEARML_API_HOST)
To get started with ClearML: setup your own clearml-server, or create a free account at https://app.clear.ml
Training will still continue.
Starting classification training.
Found 8134 images belonging to 2 classes.
Processing dataset (train): /workspace/tao-experiments/data/split/train
Found 1692 images belonging to 2 classes.
Processing dataset (validation): /workspace/tao-experiments/data/split/val
Only .hdf5, .tlt, .tltb are supported.
Error executing job with overrides:
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 211, in run_and_report
return func()
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 368, in
lambda: hydra.run(
File “/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py”, line 88, in _patched_hydra_run
return PatchHydra._original_hydra_run(self, config_name, task_function, overrides, *args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py”, line 110, in run
_ = ret.return_value
File “/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py”, line 233, in return_value
raise self._return_value
File “/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py”, line 160, in run_job
ret.return_value = task_function(task_cfg)
File “/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py”, line 170, in _patched_task_function
return task_function(a_config, *a_args, **a_kwargs)
File “”, line 408, in main
File “”, line 76, in _func
File “”, line 49, in _func
File “”, line 319, in run_experiment
File “”, line 364, in load_model
AssertionError: Only .hdf5, .tlt, .tltb are supported.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “</usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/classification/scripts/train.py>”, line 3, in
File “”, line 412, in
File “”, line 87, in wrapper
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 367, in _run_hydra
run_and_report(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 251, in run_and_report
assert mdl is not None
AssertionError
Sending telemetry data.
Telemetry data couldn’t be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-01-03 15:16:26,615 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Morganh · January 3, 2023, 1:32pm

The ngc pretrained models are available in TAO Pretrained EfficientDet | NVIDIA NGC and TAO Pretrained Classification | NVIDIA NGC

system · January 17, 2023, 1:32pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TAO Classification TF2 AssertionError: Only .hdf5, .tlt, .tltb are supported TAO Toolkit	5	510	January 27, 2023
Error in TAO-Toolkit while training TAO Toolkit	15	1512	July 6, 2022
Error in TAO-toolkit classification_tf2 train TAO Toolkit	21	569	January 26, 2024
Error while training ActionRecognitionNet with TAO TAO Toolkit	14	1508	February 8, 2022
Bad results, while running inference on the pretrained Image Classification models TAO Toolkit image-processing	6	43	November 15, 2024
FileNotFoundError: Model not found TAO Toolkit	5	113	July 27, 2024
Erorr when training the model using TAO for custom action recognitinon net TAO Toolkit	21	641	July 4, 2023
TAO toolkit 5.3 actionrecognitionnet training error for joint model, network shape mismatch TAO Toolkit	5	32	December 5, 2024
TLT V2.0 Classification TAO Toolkit	26	2787	August 3, 2021
Error while training detectnet v2 taotollkit on default notebook TAO Toolkit	2	308	March 9, 2024

Unable to train image classification tf2

Related topics