Add class to pretrained model using TAO 5.5.0

As the title says, I am trying to retrain an existing model (PleopleNet Transformer v2) to add a class to the four it already has (BG, Person, Face, Bag), using TAO 5.5.0.

Here’s the full setup:

• Hardware (T4/V100/Xavier/Nano/etc)
RTX 2080 on Ubuntu 20.04

• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
PleopleNet Transformer v2

• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)

> tao info
Configuration of the TAO Toolkit Instance
task_group: ['model', 'dataset', 'deploy']
format_version: 3.0
toolkit_version: 5.5.0
published_date: 08/26/2024

• Training spec file(If have, please share here)

dataset:
  train_data_sources:
    - image_dir: /data/train/data
      json_file: /data/train/labels.json
  val_data_sources:
    - image_dir: /data/validation/data
      json_file: /data/validation/labels.json
  num_classes: 5
  batch_size: 2
  workers: 4
  augmentation:
    fixed_padding: False
train:
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 2e-05
    lr: 2e-4
    lr_steps: [11]
    momentum: 0.9
  num_epochs: 12
  pretrained_model_path: /model/dino_fan_small_astro_delta.pth
model:
  backbone: fan_small
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 100
  num_select: 50
  dropout_ratio: 0.0
  dim_feedforward: 2048

I have a dataset with 3000 images coming from Open Images, 1000 of them having Baseball bat (the new class) annotations, 2700 of them having Person annotations. All the other annotations of Open Images have been removed and classes have been ordered the same as in the pretrained model (BG, Person, Face, Bag) with the new one at the end (Baseball bat). The dataset is in COCO format (converted with FiftyOne).

The pretrained model is dino_fan_small_astro_delta.pth, as downloaded from PeopleNet Transformer v2.0 | NVIDIA NGC. The train.yaml file is based on the sample given on that model’s page, and since that page says “This model was trained using the DINO entrypoint in TAO”, I’m using that entry point in my TAO command:

tao model dino train  -e /specs/train.yaml results_dir=/results/

When I start the training session, I see a lot of these errors (or similar):

Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 678, in format
    record.message = record.getMessage()
  File "/usr/lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 152, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py", line 107, in wrapper
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 119, in run
    ret = run_job(
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 146, in main
    run_experiment(experiment_config=cfg,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 70, in run_experiment
    logging.info(f"skip layer: {k}, checkpoint layer size: {list(v.size())},",
Arguments: ('current model layer size: [5]',)--- Logging error ---, checkpoint layer size: [4],'

The training run ends with this final failure:

Error executing job with overrides: ['results_dir=/results/']Traceback (most recent call last):hread: [62,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed..ndex < sizes[i] && "index out of bounds"` failed.
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1032, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 138, in run
    self.advance(data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 242, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 191, in run
    self._optimizer_step(batch_idx, closure)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 269, in _optimizer_step
    call._call_lightning_module_hook(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1303, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 152, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 239, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 122, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 391, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 165, in step
    loss = closure()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 108, in _wrap_closure
    closure_result = closure()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 144, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 129, in closure
    step_output = self._step_fn()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 319, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 391, in training_step
    return self.lightning_module.training_step(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/pl_dino_model.py", line 195, in training_step
    loss_dict = self.criterion(outputs, targets)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/criterion.py", line 174, in forward
    indices = self.matcher(outputs_without_aux, targets)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/matcher.py", line 89, in forward
    cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/box_ops.py", line 28, in box_cxcywh_to_xyxy
    return torch.stack(b, dim=-1)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I suspect the issue is that the model was pretrained with 4 classes, but my data set has 5 and I need the output of the model to output 5 values:

dataset:
  num_classes: 5

Is that the cause of the problem? How to I tell TAO about this change in the model and get it to reconfigure the last layer(s) of the model to support the new class?
I have read through DINO - NVIDIA Docs but cannot find anything related to that (other that the dataset/num_classes option).

I have looked at TAO Toolkit Use Cases - 4. Add new classes of objects to an existing AI model | NVIDIA Developer and at the sample project, but I believe it was written for an older version of TAO and no longer seems applicable. Is there an updated version of this tutorial somewhere?

Any help/hint much appreciated.

I think I have found my mistake: in the dataset conversion from Open Images format to COCO, I have incorrectly set the class identifiers thus:

classes = {
    "BG": 1,
    "Person": 2,
    "Face": 3,
    "Bag": 4,
    "Baseball bat": 5
}

because label.txt for PeopleNet Transformer v2 in the DeepStream TAO apps project contains

BG
Person
Face
Bag

The correct setup is

classes = {
    "Person": 1,
    "Face": 2,
    "Bag": 3,
    "Baseball bat": 4
}

because BG is the implicit ‘background’ class using id 0.
Thus the dataset.num_classes field in my train.yaml is now correct, being the max(class.id)+1.

I still get these errors when starting the session

Call stack:
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 152, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py", line 107, in wrapper
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 119, in run
    ret = run_job(
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 146, in main
    run_experiment(experiment_config=cfg,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 70, in run_experiment
    logging.info(f"skip layer: {k}, checkpoint layer size: {list(v.size())},",
Arguments: ('current model layer size: [5]',)--- Logging error ---checkpoint layer size: [4],'

but then the training runs.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.