Could you share $nvidia-smi ?
Also, can you open a terminal and run in it? Currently, you are running with notebook. I suggest you to run in the terminal instead to narrow down.
$ tao model segformer run /bin/bash
Then inside the docker, run below. #segformer train xxx
/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 11 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’
david@AI01:~$ nvidia-smi
Thu Jul 18 05:52:25 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:65:00.0 On | N/A |
| 30% 57C P2 316W / 350W | 5090MiB / 24576MiB | 90% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2326 G /usr/lib/xorg/Xorg 224MiB |
| 0 N/A N/A 2543 G /usr/bin/gnome-shell 124MiB |
| 0 N/A N/A 6379 G ...19,262144 --variations-seed-version 146MiB |
| 0 N/A N/A 47254 C /usr/bin/python 4578MiB |
+---------------------------------------------------------------------------------------+
I think the problem is when the dataset is relatively large, memory runs out.
At the beginning of training the whole train dataset seems to be loading. Then, atvalidation_interval it runs validation on 3,500 validation images, and when loading those images is that I estimate memory runs out.
I ran 100,000 iterations and set validation_interval also at 100,000, and the validation failed, but training completed.
After exporting to tensorrt it took about an hour to validate but with very good results.