Please provide the following information when requesting support.
• RTX3050
• Deformable_DETR
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(In attachment)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.) trainspec.txt (1.1 KB)
I am currently training an object detection model using deformable_detr by NGC, following according to the documentation provided. Dataset used are in COCO format. Any clue to solve the below error? Error log is attached in the attachment errorlog.txt (6.4 KB)
Command line:
tao model deformable_detr train -e /home/jason/Desktop/deformable_detr/trainspec.txt -r /home/jason/Desktop/deformable_detr/
Is the json file available inside the docker?
You can check with
method1:
$ tao model deformable_detr ls -rlt /media/jason/New_Volume/cocotrainval/annotations/train_result.json
method2(in order to debug inside the docker)
$ tao model deformable_detr run /bin/bash
Then will login the docker,
You can check if the json file is available.
If not available, there should be something mismatching in the ~/.tao_mounts.json file. This file can map local files into the docker.
In the tao command line, it is needed to set a path inside the docker instead of local path.
After checking using the method you provide, I am still having the exact same error as before. Below are some clarifications for your reference.
I have updated my ~/.tao_mounts.json to a shorter path to ease in our discussion. The train_result.json is in the following path. Below i attached the updated train_spec.txt for your reference. trainspec.txt (977 Bytes)
/workspace/annotations/train_result.json
After running $ tao model deformable_detr run /bin/bash, below attachment shows the location of the train_result.json.
I have vim the data_source_config.py that you provide and added the debug line, unfortunately the debug line cannot be found in the error message.
Command line attached with the error message as shown below: errorlog.txt (6.1 KB)
For your reference, i refer to here to implement the deformable_detr. The only thing that I have amend is dataset source path for the train_spec and downloaded the deformable_detr model from ngc catalog.
I have changed the txt to yaml file, and i think that’s a good start to solve the previous bottleneck. Unfortunately there is another new error.
Below attached is the error log that i got for your reference: errorlog.txt (9.0 KB)
Is there any possibilities that the version of the hydra package is causing this problem?
@Morganh A big thanks for the resources you sent above. I am able to train the deformable_detr model now but only using with 1 worker and 1 batch size. It takes quite a long time as each epoch takes up to 30-45 minutes to complete and I have got this error below after the model is training on the 16th epoch.
Epoch 16: 100%|█████████████████████████████████████████████████████| 4356/4356 [17:09<00:00, 4.23it/s, loss=18.6, v_num=0, val_loss=21.50, train_loss=20.00Train and Val metrics generated.
Epoch 16: 100%|█████████████████████████████████████████████████████| 4356/4356 [17:09<00:00, 4.23it/s, loss=18.6, v_num=0, val_loss=21.50, train_loss=20.00]Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: module 'urllib3.exceptions' has no attribute 'SubjectAltNameWarning'
Execution status: FAIL
2024-02-29 21:19:35,182 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.