Error while training Deformable_detr using TAO

Please provide the following information when requesting support.

• RTX3050
• Deformable_DETR
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(In attachment)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
trainspec.txt (1.1 KB)

I am currently training an object detection model using deformable_detr by NGC, following according to the documentation provided. Dataset used are in COCO format. Any clue to solve the below error? Error log is attached in the attachment
errorlog.txt (6.4 KB)

Command line:

tao model deformable_detr train -e /home/jason/Desktop/deformable_detr/trainspec.txt -r /home/jason/Desktop/deformable_detr/

The error comes from tao_pytorch_backend/nvidia_tao_pytorch/cv/deformable_detr/utils/data_source_config.py at main · NVIDIA/tao_pytorch_backend · GitHub.

Is the json file available inside the docker?
You can check with
method1:
$ tao model deformable_detr ls -rlt /media/jason/New_Volume/cocotrainval/annotations/train_result.json

method2(in order to debug inside the docker)
$ tao model deformable_detr run /bin/bash
Then will login the docker,
You can check if the json file is available.

If not available, there should be something mismatching in the ~/.tao_mounts.json file. This file can map local files into the docker.
In the tao command line, it is needed to set a path inside the docker instead of local path.

After checking using the method you provide, I am still having the exact same error as before. Below are some clarifications for your reference.

I have updated my ~/.tao_mounts.json to a shorter path to ease in our discussion. The train_result.json is in the following path. Below i attached the updated train_spec.txt for your reference.
trainspec.txt (977 Bytes)

/workspace/annotations/train_result.json

After running $ tao model deformable_detr run /bin/bash, below attachment shows the location of the train_result.json.

Any advice to solve this problem?

How about /home/jason/Desktop/deformable_detr/trainspec.txt?
Can you check where is it inside the docker?

It is on the same path in the docker as well.

So, please try again inside the docker. Please use below command directly.

deformable_detr train -e /home/jason/Desktop/deformable_detr/trainspec.txt -r /home/jason/Desktop/deformable_detr/

The exact error still exists after using the exact command you sent inside the docker.

Could you debug inside the docker as below way?
Inside the docker,

# cp /usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/data_source_config.py /usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/data_source_config.py.bak

Then copy the content from tao_pytorch_backend/nvidia_tao_pytorch/cv/deformable_detr/utils/data_source_config.py at main · NVIDIA/tao_pytorch_backend · GitHub and vim a new one.
$ vim /usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/data_source_config.py

Add debug line in

87        _files = data_source["json_file"]
          print("[Debug]: _files is {}".format(_files))    #debug line
88        extension = os.path.splitext(os.path.basename(_files))[1]

And run again. Thanks.

I have vim the data_source_config.py that you provide and added the debug line, unfortunately the debug line cannot be found in the error message.

Command line attached with the error message as shown below:
errorlog.txt (6.1 KB)

For your reference, i refer to here to implement the deformable_detr. The only thing that I have amend is dataset source path for the train_spec and downloaded the deformable_detr model from ngc catalog.

Could you try to do more experiments? Such as, add print info after tao_pytorch_backend/nvidia_tao_pytorch/cv/deformable_detr/utils/data_source_config.py at main · NVIDIA/tao_pytorch_backend · GitHub to print the data_source.

I have tried to debug in the data_source_config.py and found out that the data_sources in build_data_source_lists() function is empty.

82  if type(data_sources).__name__ == "DictConfig":
83        data_sources = [data_sources]
    print(data_sources)
85  for data_source in data_sources:

Any advice to further debug on this would be much appreciated.

I think I find the root cause. Please change the txt to yaml file.
The yaml file is expected.

I have changed the txt to yaml file, and i think that’s a good start to solve the previous bottleneck. Unfortunately there is another new error.
Below attached is the error log that i got for your reference:
errorlog.txt (9.0 KB)

Is there any possibilities that the version of the hydra package is causing this problem?

It is needed to set pretrained_backbone_path.
Refer to tao_tutorials/notebooks/tao_launcher_starter_kit/deformable_detr/specs/train.yaml at main · NVIDIA/tao_tutorials · GitHub

BTW, to get started, you can get more info from notebook tao_tutorials/notebooks/tao_launcher_starter_kit/deformable_detr/deformable_detr.ipynb at main · NVIDIA/tao_tutorials · GitHub.

@Morganh A big thanks for the resources you sent above. I am able to train the deformable_detr model now but only using with 1 worker and 1 batch size. It takes quite a long time as each epoch takes up to 30-45 minutes to complete and I have got this error below after the model is training on the 16th epoch.

Epoch 16: 100%|█████████████████████████████████████████████████████| 4356/4356 [17:09<00:00,  4.23it/s, loss=18.6, v_num=0, val_loss=21.50, train_loss=20.00Train and Val metrics generated.                                                                                                                               
Epoch 16: 100%|█████████████████████████████████████████████████████| 4356/4356 [17:09<00:00,  4.23it/s, loss=18.6, v_num=0, val_loss=21.50, train_loss=20.00]Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: module 'urllib3.exceptions' has no attribute 'SubjectAltNameWarning'
Execution status: FAIL
2024-02-29 21:19:35,182 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Any idea that cause the issue above?

I am afraid it is due to out-of-memory.
More info can be found in Deformable detr model keeps failing to train - #4 by ianjasonmin and tao_tutorials/notebooks/tao_launcher_starter_kit/deformable_detr/deformable_detr.ipynb at main · NVIDIA/tao_tutorials · GitHub.

I see. Thanks for your help. Appreciate it.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.