Error torch.distributed.elastic.multiprocessing.api:failed?

Please provide the following information when requesting support.

• Hardware RTX3090
• Network Type Segformer FAN
• TLT Version

tao info --verbose

Configuration of the TAO Toolkit Instance

task_group:
model:
dockers:
nvidia/tao/tao-toolkit:
5.0.0-tf2.11.0:
docker_registry: nvcr.io
tasks:
1. classification_tf2
2. efficientdet_tf2
5.0.0-tf1.15.5:
docker_registry: nvcr.io
tasks:
1. bpnet
2. classification_tf1
3. converter
4. detectnet_v2
5. dssd
6. efficientdet_tf1
7. faster_rcnn
8. fpenet
9. lprnet
10. mask_rcnn
11. multitask_classification
12. retinanet
13. ssd
14. unet
15. yolo_v3
16. yolo_v4
17. yolo_v4_tiny
5.2.0-pyt2.1.0:
docker_registry: nvcr.io
tasks:
1. action_recognition
2. centerpose
3. deformable_detr
4. dino
5. mal
6. ml_recog
7. ocdnet
8. ocrnet
9. optical_inspection
10. pointpillars
11. pose_classification
12. re_identification
13. visual_changenet
5.2.0.1-pyt1.14.0:
docker_registry: nvcr.io
tasks:
1. classification_pyt
2. segformer
dataset:
dockers:
nvidia/tao/tao-toolkit:
5.2.0-data-services:
docker_registry: nvcr.io
tasks:
1. augmentation
2. auto_label
3. annotations
4. analytics
deploy:
dockers:
nvidia/tao/tao-toolkit:
5.2.0-deploy:
docker_registry: nvcr.io
tasks:
1. visual_changenet
2. centerpose
3. classification_pyt
4. classification_tf1
5. classification_tf2
6. deformable_detr
7. detectnet_v2
8. dino
9. dssd
10. efficientdet_tf1
11. efficientdet_tf2
12. faster_rcnn
13. lprnet
14. mask_rcnn
15. ml_recog
16. multitask_classification
17. ocdnet
18. ocrnet
19. optical_inspection
20. retinanet
21. segformer
22. ssd
23. trtexec
24. unet
25. yolo_v3
26. yolo_v4
27. yolo_v4_tiny
format_version: 3.0
toolkit_version: 5.2.0.1
published_date: 01/16/2024

• Training spec file (added txt extension to be able to upload. yaml files not allowed) fan_train512X512.yaml.txt (2.6 KB)

• How to reproduce the issue?

I run

!tao model segformer train \
                  -e $SPECS_DIR/fan_train512X512.yaml \
                  -r $RESULTS_DIR/ \
                  -g $NUM_GPUS

And this is the complete run log torch.distributed.elastic.multiprocessing.api:failed.log (13.8 KB)

No clue what to do.

Could you set a lower batch size and retry?

@Morganh I set to 1, and get the same error. Now at the very beginning:

batch_size: 1

[>>>>>>>>>>>>>>>>>>>>>>>>>>] 3591/3591, 30.1 task/s, elapsed: 119s, ETA: 0sERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 471) of binary: /usr/bin/python

3591 is number of validation images…

Could you share $nvidia-smi ?
Also, can you open a terminal and run in it? Currently, you are running with notebook. I suggest you to run in the terminal instead to narrow down.
$ tao model segformer run /bin/bash
Then inside the docker, run below.
#segformer train xxx

Running from the docker still produces an error

/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 11 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’

david@AI01:~$ nvidia-smi
Thu Jul 18 05:52:25 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:65:00.0  On |                  N/A |
| 30%   57C    P2             316W / 350W |   5090MiB / 24576MiB |     90%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2326      G   /usr/lib/xorg/Xorg                          224MiB |
|    0   N/A  N/A      2543      G   /usr/bin/gnome-shell                        124MiB |
|    0   N/A  N/A      6379      G   ...19,262144 --variations-seed-version      146MiB |
|    0   N/A  N/A     47254      C   /usr/bin/python                            4578MiB |
+---------------------------------------------------------------------------------------+

Complete docker log… Docker tao segformer error 20240718.log (12.0 KB)

However, I found a log file with a message

TorchVision: 0.15.0a0\nOpenCV: 4.8.0\nMMCV: 1.7.1\nMMCV Compiler: GCC 9.4\nMMCV CUDA Compiler: not available"}

The complete log file is here: 20240718_031037.log.json.txt (2.5 KB)

May I know if you can run segformer training successfully previously in your machine?

Yes, Many times

Did you notice this?

However running

root@9a6a74a823f4:/usr/local# nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

Did you ever keep the previous successful training log? Please try to compare it to your current log about MMCV CUDA Compiler: not available.

@Morganh:

I think the problem is when the dataset is relatively large, memory runs out.

At the beginning of training the whole train dataset seems to be loading. Then, atvalidation_interval it runs validation on 3,500 validation images, and when loading those images is that I estimate memory runs out.

I ran 100,000 iterations and set validation_interval also at 100,000, and the validation failed, but training completed.

After exporting to tensorrt it took about an hour to validate but with very good results.

Thanks for the attention.

DB

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.