Can not integrate WandB with Classification-TF2

Hello, I am trying to integrate WandB with Classification-TF2 following this tutorial. TAO WandB Integration - NVIDIA Docs

While it works for DetectNet-v2, it didn’t work for Classification-TF2.

Here is the reproduce steps.

  1. login the wandb account with
os.environ["WANDB_API_KEY"] = "my api key"
import wandb
WANDB_LOGGED_IN = wandb.login()
if WANDB_LOGGED_IN:
print("WANDB successfully logged in.")
  1. set the ~/.tao_mounts.json file as
{
    "Mounts": [
        {
            "source": "/home/nvidia/tao_tutorials/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/home/nvidia/tao_tutorials/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc/specs",
            "destination": "/workspace/tao-experiments/classification_tf2/tao_voc/specs"
        }
    ],
    "DockerOptions": {},
    "Envs": [
        {
            "variable": "WANDB_API_KEY",
            "value": "my api key"
        }
    ]
}
  1. set the training spec spec.ymal file as
results_dir: '/workspace/tao-experiments/classification_tf2/output'
dataset:
  train_dataset_path: "/workspace/tao-experiments/data/split/training_set"
  val_dataset_path: "/workspace/tao-experiments/data/split/val_set"
  preprocess_mode: 'torch'
  num_classes: 2
  augmentation:
    enable_color_augmentation: True
    enable_center_crop: True
train:
  qat: False
  checkpoint: ''
  batch_size_per_gpu: 32
  num_epochs: 120
  optim_config:
    optimizer: 'adam'
  lr_config:
    scheduler: 'cosine'
    learning_rate: 0.05
    soft_start: 0.05
  reg_config:
    type: 'L2'
    scope: ['conv2d', 'dense']
    weight_decay: 0.00005
  wandb:
    entity: "name_of_entity"
    name: "name_of_the_experiment"
    project: "name_of_the_project"
model:
  backbone: 'efficientnet-b0'
  input_width: 256
  input_height: 256
  input_channels: 3
  input_image_depth: 8
evaluate:
  dataset_path: "/workspace/tao-experiments/data/split/test_set"
  checkpoint: "/workspace/tao-experiments/classification_tf2/output/train/efficientnet-b0_098.tlt"
  top_k: 1
  batch_size: 256
  n_workers: 8
prune:
  checkpoint: '/workspace/tao-experiments/classification_tf2/output/train/efficientnet-b0_120.tlt'
  threshold: 0.68
  byom_model_path: ''
  1. training the model with this command !tao model classification_tf2 train -e $SPECS_DIR/spec.yaml on sample jupyter notebook

  2. this error shows

2024-12-12 09:24:28,366 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-12-12 09:24:28,437 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-tf2
2024-12-12 09:24:28,462 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/nvidia/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-12-12 09:24:28,462 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-12-12 00:24:30.011626: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-12 00:24:30.011686: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-12 00:24:30.013371: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-12 00:24:30.020559: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Train results will be saved at: /workspace/tao-experiments/classification_tf2/output/train
wandb: Currently logged in as: 99 (99-personal). Use `wandb login --relogin` to force relogin
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
Initializing wandb.
wandb: Currently logged in as: 99. Use `wandb login --relogin` to force relogin
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 1176, in init
    run = wi.init()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 633, in init
    run = Run(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 566, in __init__
    self._init(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 676, in _init
    self._config._update(config, ignore_locked=True)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_config.py", line 177, in _update
    sanitized = self._sanitize_dict(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_config.py", line 264, in _sanitize_dict
    k, v = self._sanitize(k, v, allow_val_change)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_config.py", line 282, in _sanitize
    val = json_friendly_val(val)
  File "/usr/local/lib/python3.10/dist-packages/wandb/util.py", line 671, in json_friendly_val
    converted = asdict(val)
  File "/usr/lib/python3.10/dataclasses.py", line 1238, in asdict
    return _asdict_inner(obj, dict_factory)
  File "/usr/lib/python3.10/dataclasses.py", line 1245, in _asdict_inner
    value = _asdict_inner(getattr(obj, f.name), dict_factory)
  File "/usr/lib/python3.10/dataclasses.py", line 1275, in _asdict_inner
    return type(obj)((_asdict_inner(k, dict_factory),
TypeError: first argument must be callable or None
Problem at: <frozen common.mlops.wandb> 119 initialize_wandb
Wandb logging failed with error An unexpected error occurred

Thanks in advance for your support!


Please provide the following information when requesting support.

• Hardware (A40-16q)
• Network Type (Classification-TF2)
• TLT Version (
Configuration of the TAO Toolkit Instance
task_group: [‘model’, ‘dataset’, ‘deploy’]
format_version: 3.0
toolkit_version: 5.5.0
published_date: 08/26/2024
)
• Training spec file(as shown above)
• How to reproduce the issue ? (as shown above)

Could you add below and retry?
tags: “tao_toolkit”

Then the training itself will fail with the following error:

2024-12-15 16:48:22,713 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-12-15 16:48:22,780 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-tf2
2024-12-15 16:48:22,804 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/nvidia/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-12-15 16:48:22,804 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-12-15 07:48:24.320496: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-15 07:48:24.320545: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-15 07:48:24.322271: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-15 07:48:24.329460: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/config_loader_impl.py", line 457, in _load_single_config
    merged = OmegaConf.merge(schema.config, config)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/omegaconf.py", line 273, in merge
target.merge_with(*configs[1:])  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 492, in merge_with
    self._format_and_raise(key=None, value=None, cause=e)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/_utils.py", line 819, in format_and_raise
    _raise(ex, cause)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/_utils.py", line 797, in _raise
raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 490, in merge_with
    self._merge_with(*others)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 514, in _merge_with
    BaseContainer._map_merge(self, other)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 399, in _map_merge
    dest_node._merge_with(src_node)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 514, in _merge_with
    BaseContainer._map_merge(self, other)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 399, in _map_merge
    dest_node._merge_with(src_node)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 514, in _merge_with
    BaseContainer._map_merge(self, other)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 401, in _map_merge
    dest.__setitem__(key, src_node)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/dictconfig.py", line 314, in __setitem__
    self._format_and_raise(key=key, value=value, cause=e)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/_utils.py", line 797, in _raise
raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace  File "/usr/local/lib/python3.10/dist-packages/omegaconf/dictconfig.py", line 308, in __setitem__
    self.__set_impl(key=key, value=value)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/dictconfig.py", line 318, in __set_impl
    self._set_item_impl(key, value)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 604, in _set_item_impl
self.__dict__["_content"][key]._set_value(value)  File "/usr/local/lib/python3.10/dist-packages/omegaconf/listconfig.py", line 618, in _set_value
    raise e
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/listconfig.py", line 614, in _set_value
    self._set_value_impl(value, flags)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/listconfig.py", line 646, in _set_value_impl
    raise ValidationError(msg)
omegaconf.errors.ValidationError: Invalid value assigned: AnyNode is not a ListConfig, list or tuple.
    full_key: train.wandb.tags
    reference_type=WandBConfig
    object_type=WandBConfig
train:
  qat: False
  checkpoint: ''
  batch_size_per_gpu: 32
  num_epochs: 120
  optim_config:
    optimizer: 'sgd'
  lr_config:
    scheduler: 'cosine'
    learning_rate: 0.05
    soft_start: 0.05
  reg_config:
    type: 'L2'
    scope: ['conv2d', 'dense']
    weight_decay: 0.00005
  wandb:
    entity: "name_of_entity"
    name: "name_of_the_experiment"
    project: "name_of_the_project"
    tags: "tao_toolkit"

Could you please retry with below?
tags: ["classification", "training", "tao-toolkit"]

Still not working. Looks like the error is the same as the first one.

2024-12-15 17:22:32,888 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-12-15 17:22:32,953 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-tf2
2024-12-15 17:22:32,975 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/nvidia/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-12-15 17:22:32,975 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-12-15 08:22:34.468442: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-15 08:22:34.468493: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-15 08:22:34.470208: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-15 08:22:34.477118: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Train results will be saved at: /workspace/tao-experiments/classification_tf2/output/train
wandb: Currently logged in as: 99 (99-personal). Use `wandb login --relogin` to force relogin
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
Initializing wandb.
wandb: Currently logged in as: 99. Use `wandb login --relogin` to force relogin
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 1176, in init
    run = wi.init()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 633, in init
    run = Run(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 566, in __init__
    self._init(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 676, in _init
    self._config._update(config, ignore_locked=True)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_config.py", line 177, in _update
    sanitized = self._sanitize_dict(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_config.py", line 264, in _sanitize_dict
    k, v = self._sanitize(k, v, allow_val_change)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_config.py", line 282, in _sanitize
    val = json_friendly_val(val)
  File "/usr/local/lib/python3.10/dist-packages/wandb/util.py", line 671, in json_friendly_val
    converted = asdict(val)
  File "/usr/lib/python3.10/dataclasses.py", line 1238, in asdict
    return _asdict_inner(obj, dict_factory)
  File "/usr/lib/python3.10/dataclasses.py", line 1245, in _asdict_inner
    value = _asdict_inner(getattr(obj, f.name), dict_factory)
  File "/usr/lib/python3.10/dataclasses.py", line 1275, in _asdict_inner
    return type(obj)((_asdict_inner(k, dict_factory),
TypeError: first argument must be callable or None
Problem at: <frozen common.mlops.wandb> 119 initialize_wandb
Wandb logging failed with error An unexpected error occurred

Could you please use nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0 and retry? There is not this error in this old docker.