I am trying to train RT-DETR model using RESNET-50 backbone. i am getting the error below. Changed the backbone, didnt solve it. The input data is all at resolution height 832, width 4096, num of classs are 3.
Config file is attached too.
”Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.12/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:654: Checkpoint directory /results/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params | Mode
0 | model | RTDETRModel | 42.7 M | train
1 | matcher | HungarianMatcher | 0 | train
2 | criterion | SetCriterion | 0 | train
3 | box_processors | RTDETRPostProcess | 0 | train
42.7 M Trainable params
9.4 K Non-trainable params
42.7 M Total params
170.856 Total estimated model params size (MB)
596 Modules in train mode
1 Modules in eval mode
Serializing 4463 elements to byte tensors and concatenating them all …
Serialized dataset takes 1.25 MiB
Worker 0 obtains a dataset of length=4463 from its local leader.
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]Error executing job with overrides: Traceback (most recent call last):
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py”, line 69, in _func
runner(cfg, **kwargs)
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/scripts/train.py”, line 142, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/scripts/train.py”, line 128, in run_experiment
trainer.fit(pt_model, dm, ckpt_path=resume_ckpt)
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py”, line 539, in fit
call._call_and_handle_interrupt(
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/call.py”, line 47, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py”, line 575, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py”, line 982, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1024, in _run_stage
self._run_sanity_check()
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1053, in _run_sanity_check
val_loop.run()
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/utilities.py”, line 179, in _decorator
return loop_run(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/evaluation_loop.py”, line 144, in run
self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/evaluation_loop.py”, line 433, in _evaluation_step
output = call._call_strategy_hook(trainer, hook_name, *step_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/call.py”, line 323, in _call_strategy_hook
output = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/strategies/strategy.py”, line 412, in validation_step
return self.lightning_module.validation_step(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/pl_rtdetr_model.py”, line 287, in validation_step
outputs = self.model(data, targets)
^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/build_nn_model.py”, line 157, in forward
x = self.model(x, targets)
^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/rtdetr.py”, line 88, in forward
x, proj_feats = self.encoder(feats)
^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/hybrid_encoder.py”, line 339, in forward
memory = self.encoder[i](src_flatten, pos_embed=pos_embed) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/hybrid_encoder.py”, line 203, in forward
output = layer(output, src_mask=src_mask, pos_embed=pos_embed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/hybrid_encoder.py”, line 172, in forward
q = k = self.with_pos_embed(src, pos_embed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/hybrid_encoder.py”, line 165, in with_pos_embed
return tensor if pos_embed is None else tensor + pos_embed
~~~~~~~^~~~~~~~~~~
RuntimeError: The size of tensor a (16384) must match the size of tensor b (3328) at non-singleton dimension 1
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
E1215 20:56:05.150000 366 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 393) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==2.6.0a0+ecf3bae40a.nv25.1’, ‘console_scripts’, ‘torchrun’)())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/_init_.py”, line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py”, line 918, in main
run(args)
File “/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py”, line 909, in run
elastic_launch(
File “/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py”, line 138, in _call_
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py”, line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/scripts/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]: time : 2025-12-15_20:56:05
host : keirton-dev-5090-MS-7E47
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 393)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.9 documentation
2025-12-15 20:56:05,322 - [TAO Toolkit] - WARNING - Telemetry data couldn’t be sent, but the command ran successfully. (entrypoint.py:350)
2025-12-15 20:56:05,322 - TAO Toolkit - WARNING - Telemetry data couldn’t be sent, but the command ran successfully.
2025-12-15 20:56:05,322 - [TAO Toolkit] - WARNING - [Error]: ‘str’ object has no attribute ‘decode’ (entrypoint.py:353)
2025-12-15 20:56:05,322 - TAO Toolkit - WARNING - [Error]: ‘str’ object has no attribute ‘decode’
2025-12-15 20:56:05,322 - [TAO Toolkit] - WARNING - Execution status: FAIL (entrypoint.py:357)
2025-12-15 20:56:05,322 - TAO Toolkit - WARNING - Execution status: FAIL
sys:1: ResourceWarning: unclosed file <_io.TextIOWrapper name=27 encoding=‘UTF-8’>
/usr/lib/python3.12/tempfile.py:1075: ResourceWarning: Implicitly cleaning up <TemporaryDirectory ‘/tmp/tmpkrfp7qol’>
_warnings.warn(warn_message, ResourceWarning)
2025-12-15 12:56:06,446 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 371: Stopping container."
train.yaml.txt (1010 Bytes)