As the title says, I am trying to retrain an existing model (PleopleNet Transformer v2) to add a class to the four it already has (BG, Person, Face, Bag), using TAO 5.5.0.
Here’s the full setup:
• Hardware (T4/V100/Xavier/Nano/etc)
RTX 2080 on Ubuntu 20.04
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
PleopleNet Transformer v2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
> tao info
Configuration of the TAO Toolkit Instance
task_group: ['model', 'dataset', 'deploy']
format_version: 3.0
toolkit_version: 5.5.0
published_date: 08/26/2024
• Training spec file(If have, please share here)
dataset:
train_data_sources:
- image_dir: /data/train/data
json_file: /data/train/labels.json
val_data_sources:
- image_dir: /data/validation/data
json_file: /data/validation/labels.json
num_classes: 5
batch_size: 2
workers: 4
augmentation:
fixed_padding: False
train:
num_gpus: 1
num_nodes: 1
validation_interval: 1
optim:
lr_backbone: 2e-05
lr: 2e-4
lr_steps: [11]
momentum: 0.9
num_epochs: 12
pretrained_model_path: /model/dino_fan_small_astro_delta.pth
model:
backbone: fan_small
train_backbone: True
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 100
num_select: 50
dropout_ratio: 0.0
dim_feedforward: 2048
I have a dataset with 3000 images coming from Open Images, 1000 of them having Baseball bat (the new class) annotations, 2700 of them having Person annotations. All the other annotations of Open Images have been removed and classes have been ordered the same as in the pretrained model (BG, Person, Face, Bag) with the new one at the end (Baseball bat). The dataset is in COCO format (converted with FiftyOne).
The pretrained model is dino_fan_small_astro_delta.pth, as downloaded from PeopleNet Transformer v2.0 | NVIDIA NGC. The train.yaml file is based on the sample given on that model’s page, and since that page says “This model was trained using the DINO entrypoint in TAO”, I’m using that entry point in my TAO command:
tao model dino train -e /specs/train.yaml results_dir=/results/
When I start the training session, I see a lot of these errors (or similar):
Traceback (most recent call last):
File "/usr/lib/python3.10/logging/__init__.py", line 1100, in emit
msg = self.format(record)
File "/usr/lib/python3.10/logging/__init__.py", line 943, in format
return fmt.format(record)
File "/usr/lib/python3.10/logging/__init__.py", line 678, in format
record.message = record.getMessage()
File "/usr/lib/python3.10/logging/__init__.py", line 368, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 152, in <module>
main()
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py", line 107, in wrapper
_run_hydra(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 389, in _run_hydra
_run_app(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 452, in _run_app
run_and_report(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 213, in run_and_report
return func()
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 453, in <lambda>
lambda: hydra.run(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 119, in run
ret = run_job(
File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
runner(cfg, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 146, in main
run_experiment(experiment_config=cfg,
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 70, in run_experiment
logging.info(f"skip layer: {k}, checkpoint layer size: {list(v.size())},",
Arguments: ('current model layer size: [5]',)--- Logging error ---, checkpoint layer size: [4],'
The training run ends with this final failure:
Error executing job with overrides: ['results_dir=/results/']Traceback (most recent call last):hread: [62,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed..ndex < sizes[i] && "index out of bounds"` failed.
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
results = self._run_stage()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1032, in _run_stage
self.fit_loop.run()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
self.advance()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 138, in run
self.advance(data_fetcher)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 242, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 191, in run
self._optimizer_step(batch_idx, closure)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 269, in _optimizer_step
call._call_lightning_module_hook(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1303, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 152, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 239, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 122, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 391, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 165, in step
loss = closure()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 108, in _wrap_closure
closure_result = closure()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 144, in __call__
self._result = self.closure(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 129, in closure
step_output = self._step_fn()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 319, in _training_step
training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 391, in training_step
return self.lightning_module.training_step(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/pl_dino_model.py", line 195, in training_step
loss_dict = self.criterion(outputs, targets)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/criterion.py", line 174, in forward
indices = self.matcher(outputs_without_aux, targets)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/matcher.py", line 89, in forward
cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/box_ops.py", line 28, in box_cxcywh_to_xyxy
return torch.stack(b, dim=-1)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I suspect the issue is that the model was pretrained with 4 classes, but my data set has 5 and I need the output of the model to output 5 values:
dataset:
num_classes: 5
Is that the cause of the problem? How to I tell TAO about this change in the model and get it to reconfigure the last layer(s) of the model to support the new class?
I have read through DINO - NVIDIA Docs but cannot find anything related to that (other that the dataset/num_classes option).
I have looked at TAO Toolkit Use Cases - 4. Add new classes of objects to an existing AI model | NVIDIA Developer and at the sample project, but I believe it was written for an older version of TAO and no longer seems applicable. Is there an updated version of this tutorial somewhere?
Any help/hint much appreciated.