Dataloader error

behna.rahimi · December 15, 2025, 9:02pm

I am trying to train RT-DETR model using RESNET-50 backbone. i am getting the error below. Changed the backbone, didnt solve it. The input data is all at resolution height 832, width 4096, num of classs are 3.
Config file is attached too.

”Using bfloat16 Automatic Mixed Precision (AMP)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.12/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:654: Checkpoint directory /results/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params | Mode

0 | model | RTDETRModel | 42.7 M | train

42.7 M Trainable params
9.4 K Non-trainable params
42.7 M Total params
170.856 Total estimated model params size (MB)
596 Modules in train mode
1 Modules in eval mode
Serializing 4463 elements to byte tensors and concatenating them all …
Serialized dataset takes 1.25 MiB
Worker 0 obtains a dataset of length=4463 from its local leader.

Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]Error executing job with overrides: Traceback (most recent call last):
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py”, line 69, in _func
runner(cfg, **kwargs)
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/scripts/train.py”, line 142, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/scripts/train.py”, line 128, in run_experiment
trainer.fit(pt_model, dm, ckpt_path=resume_ckpt)
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py”, line 539, in fit
call._call_and_handle_interrupt(
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/call.py”, line 47, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py”, line 575, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py”, line 982, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1024, in _run_stage
self._run_sanity_check()
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1053, in _run_sanity_check
val_loop.run()
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/utilities.py”, line 179, in _decorator
return loop_run(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/evaluation_loop.py”, line 144, in run
self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/evaluation_loop.py”, line 433, in _evaluation_step
output = call._call_strategy_hook(trainer, hook_name, *step_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/call.py”, line 323, in _call_strategy_hook
output = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/pytorch_lightning/strategies/strategy.py”, line 412, in validation_step
return self.lightning_module.validation_step(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/pl_rtdetr_model.py”, line 287, in validation_step
outputs = self.model(data, targets)
^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/build_nn_model.py”, line 157, in forward
x = self.model(x, targets)
^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/rtdetr.py”, line 88, in forward
x, proj_feats = self.encoder(feats)
^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/hybrid_encoder.py”, line 339, in forward
memory = self.encoder[i](src_flatten, pos_embed=pos_embed) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/hybrid_encoder.py”, line 203, in forward
output = layer(output, src_mask=src_mask, pos_embed=pos_embed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/hybrid_encoder.py”, line 172, in forward
q = k = self.with_pos_embed(src, pos_embed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/hybrid_encoder.py”, line 165, in with_pos_embed
return tensor if pos_embed is None else tensor + pos_embed
~~~~~~~^~~~~~~~~~~
RuntimeError: The size of tensor a (16384) must match the size of tensor b (3328) at non-singleton dimension 1

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

E1215 20:56:05.150000 366 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 393) of binary: /usr/bin/python

Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==2.6.0a0+ecf3bae40a.nv25.1’, ‘console_scripts’, ‘torchrun’)())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/_init_.py”, line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py”, line 918, in main
run(args)
File “/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py”, line 909, in run
elastic_launch(
File “/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py”, line 138, in _call_
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py”, line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/scripts/train.py FAILED

Failures:

<NO_OTHER_FAILURES>

Root Cause (first observed failure):

[0]: time : 2025-12-15_20:56:05
host : keirton-dev-5090-MS-7E47
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 393)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.9 documentation

2025-12-15 20:56:05,322 - [TAO Toolkit] - WARNING - Telemetry data couldn’t be sent, but the command ran successfully. (entrypoint.py:350)
2025-12-15 20:56:05,322 - TAO Toolkit - WARNING - Telemetry data couldn’t be sent, but the command ran successfully.
2025-12-15 20:56:05,322 - [TAO Toolkit] - WARNING - [Error]: ‘str’ object has no attribute ‘decode’ (entrypoint.py:353)
2025-12-15 20:56:05,322 - TAO Toolkit - WARNING - [Error]: ‘str’ object has no attribute ‘decode’
2025-12-15 20:56:05,322 - [TAO Toolkit] - WARNING - Execution status: FAIL (entrypoint.py:357)
2025-12-15 20:56:05,322 - TAO Toolkit - WARNING - Execution status: FAIL
sys:1: ResourceWarning: unclosed file <_io.TextIOWrapper name=27 encoding=‘UTF-8’>
/usr/lib/python3.12/tempfile.py:1075: ResourceWarning: Implicitly cleaning up <TemporaryDirectory ‘/tmp/tmpkrfp7qol’>
_warnings.warn(warn_message, ResourceWarning)
2025-12-15 12:56:06,446 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 371: Stopping container."

train.yaml.txt (1010 Bytes)

Morganh · December 16, 2025, 2:15am

behna.rahimi:

File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/rtdetr/model/hybrid_encoder.py”, line 165, in with_pos_embed
return tensor if pos_embed is None else tensor + pos_embed
RuntimeError: The size of tensor a (16384) must match the size of tensor b (3328) at non-singleton dimension 1

Please set to below and rerun.

train_spatial_size:
 - 832
 - 832
eval_spatial_size:
 - 832
 - 832

Topic		Replies	Views
Deformable detr model keeps failing to train TAO Toolkit	5	643	February 1, 2024
Training RT-DETR TAO Toolkit	3	100	September 26, 2025
Error training Faster RCNN model TAO Toolkit	17	1689	October 12, 2021
Error while running tao deformable_detr train TAO Toolkit	9	1410	July 6, 2023
Unable to train SSD-Resnet-18 TAO Toolkit	16	2120	October 12, 2021
Problem training resnet10+detectnet_v2 for multiple classes TAO Toolkit	2	794	October 12, 2021
Nvidia TensorRT PyTorch Docker :: Resnet50 Model running issues :: Jupyter TensorRT tensorrt , cuda , pytorch	3	1258	May 31, 2022
Training with TLT a detectnet_v2 resnet18 pre-trained model failed TAO Toolkit	2	662	October 12, 2021
Error with Evaluation of trained model TAO Toolkit	3	879	October 12, 2021
Error loading 'conv1' when training resnet18_ssd? TAO Toolkit	3	882	October 12, 2021