Error while training Deformable_detr using TAO

jason.cham · February 28, 2024, 7:45am

Please provide the following information when requesting support.

• RTX3050
• Deformable_DETR
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(In attachment)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
trainspec.txt (1.1 KB)

I am currently training an object detection model using deformable_detr by NGC, following according to the documentation provided. Dataset used are in COCO format. Any clue to solve the below error? Error log is attached in the attachment
errorlog.txt (6.4 KB)

Command line:

tao model deformable_detr train -e /home/jason/Desktop/deformable_detr/trainspec.txt -r /home/jason/Desktop/deformable_detr/

Morganh · February 28, 2024, 8:06am

The error comes from tao_pytorch_backend/nvidia_tao_pytorch/cv/deformable_detr/utils/data_source_config.py at main · NVIDIA/tao_pytorch_backend · GitHub.

Is the json file available inside the docker?
You can check with
method1:
$ tao model deformable_detr ls -rlt /media/jason/New_Volume/cocotrainval/annotations/train_result.json

method2(in order to debug inside the docker)
$ tao model deformable_detr run /bin/bash
Then will login the docker,
You can check if the json file is available.

If not available, there should be something mismatching in the ~/.tao_mounts.json file. This file can map local files into the docker.
In the tao command line, it is needed to set a path inside the docker instead of local path.

jason.cham · February 28, 2024, 8:43am

After checking using the method you provide, I am still having the exact same error as before. Below are some clarifications for your reference.

I have updated my ~/.tao_mounts.json to a shorter path to ease in our discussion. The train_result.json is in the following path. Below i attached the updated train_spec.txt for your reference.
trainspec.txt (977 Bytes)

/workspace/annotations/train_result.json

After running $ tao model deformable_detr run /bin/bash, below attachment shows the location of the train_result.json.

Any advice to solve this problem?

Morganh · February 28, 2024, 8:48am

How about /home/jason/Desktop/deformable_detr/trainspec.txt?
Can you check where is it inside the docker?

jason.cham · February 28, 2024, 9:01am

It is on the same path in the docker as well.

Morganh · February 28, 2024, 9:05am

So, please try again inside the docker. Please use below command directly.

deformable_detr train -e /home/jason/Desktop/deformable_detr/trainspec.txt -r /home/jason/Desktop/deformable_detr/

jason.cham · February 28, 2024, 9:07am

The exact error still exists after using the exact command you sent inside the docker.

Morganh · February 28, 2024, 9:13am

Could you debug inside the docker as below way?
Inside the docker,

# cp /usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/data_source_config.py /usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/data_source_config.py.bak

Then copy the content from tao_pytorch_backend/nvidia_tao_pytorch/cv/deformable_detr/utils/data_source_config.py at main · NVIDIA/tao_pytorch_backend · GitHub and vim a new one.
$ vim /usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/data_source_config.py

Add debug line in

87        _files = data_source["json_file"]
          print("[Debug]: _files is {}".format(_files))    #debug line
88        extension = os.path.splitext(os.path.basename(_files))[1]

And run again. Thanks.

jason.cham · February 28, 2024, 9:23am

I have vim the data_source_config.py that you provide and added the debug line, unfortunately the debug line cannot be found in the error message.

Command line attached with the error message as shown below:
errorlog.txt (6.1 KB)

For your reference, i refer to here to implement the deformable_detr. The only thing that I have amend is dataset source path for the train_spec and downloaded the deformable_detr model from ngc catalog.

Morganh · February 28, 2024, 9:31am

Could you try to do more experiments? Such as, add print info after tao_pytorch_backend/nvidia_tao_pytorch/cv/deformable_detr/utils/data_source_config.py at main · NVIDIA/tao_pytorch_backend · GitHub to print the data_source.

jason.cham · February 28, 2024, 9:47am

I have tried to debug in the data_source_config.py and found out that the data_sources in build_data_source_lists() function is empty.

82  if type(data_sources).__name__ == "DictConfig":
83        data_sources = [data_sources]
    print(data_sources)
85  for data_source in data_sources:

Any advice to further debug on this would be much appreciated.

Morganh · February 28, 2024, 9:52am

I think I find the root cause. Please change the txt to yaml file.
The yaml file is expected.

jason.cham · February 28, 2024, 9:59am

I have changed the txt to yaml file, and i think that’s a good start to solve the previous bottleneck. Unfortunately there is another new error.
Below attached is the error log that i got for your reference:
errorlog.txt (9.0 KB)

Is there any possibilities that the version of the hydra package is causing this problem?

Morganh · February 28, 2024, 3:15pm

It is needed to set pretrained_backbone_path.
Refer to tao_tutorials/notebooks/tao_launcher_starter_kit/deformable_detr/specs/train.yaml at main · NVIDIA/tao_tutorials · GitHub

BTW, to get started, you can get more info from notebook tao_tutorials/notebooks/tao_launcher_starter_kit/deformable_detr/deformable_detr.ipynb at main · NVIDIA/tao_tutorials · GitHub.

jason.cham · March 1, 2024, 2:34am

@Morganh A big thanks for the resources you sent above. I am able to train the deformable_detr model now but only using with 1 worker and 1 batch size. It takes quite a long time as each epoch takes up to 30-45 minutes to complete and I have got this error below after the model is training on the 16th epoch.

Epoch 16: 100%|█████████████████████████████████████████████████████| 4356/4356 [17:09<00:00,  4.23it/s, loss=18.6, v_num=0, val_loss=21.50, train_loss=20.00Train and Val metrics generated.                                                                                                                               
Epoch 16: 100%|█████████████████████████████████████████████████████| 4356/4356 [17:09<00:00,  4.23it/s, loss=18.6, v_num=0, val_loss=21.50, train_loss=20.00]Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: module 'urllib3.exceptions' has no attribute 'SubjectAltNameWarning'
Execution status: FAIL
2024-02-29 21:19:35,182 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Any idea that cause the issue above?

Morganh · March 1, 2024, 3:24am

I am afraid it is due to out-of-memory.
More info can be found in Deformable detr model keeps failing to train - #4 by ianjasonmin and tao_tutorials/notebooks/tao_launcher_starter_kit/deformable_detr/deformable_detr.ipynb at main · NVIDIA/tao_tutorials · GitHub.

jason.cham · March 1, 2024, 3:44am

I see. Thanks for your help. Appreciate it.

system · March 15, 2024, 3:45am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error when pulling a tao-toolkit docker file TAO Toolkit	14	726	July 24, 2023
Deformable detr model keeps failing to train TAO Toolkit	5	537	February 1, 2024
Error in TAO-Toolkit while training TAO Toolkit	15	1513	July 6, 2022
Facing error after training command TAO Toolkit	10	1084	February 28, 2022
ConfigStore schema with the same name TAO Toolkit cudnn	5	17	March 10, 2025
Tao model error TAO Toolkit	9	117	October 21, 2024
LPRNet Error TAO Toolkit	13	228	June 19, 2024
FileNotFoundError: Model not found TAO Toolkit	5	113	July 27, 2024
Error while training detectnet v2 taotollkit on default notebook TAO Toolkit	2	308	March 9, 2024
Error when using tao tool to train detectnet_v2 detection model TAO Toolkit	33	1221	February 5, 2022

Error while training Deformable_detr using TAO

Related topics