Segformer Batch Size vs Memory Consumption vs Execution time

Please provide the following information when requesting support.

• Hardware RTX4090
• Network Type Segformer FAN
• TLT Version

tao info --verbose

> Configuration of the TAO Toolkit Instance
task_group:
model:
dockers:
nvidia/tao/tao-toolkit:
5.0.0-tf2.11.0:
docker_registry: nvcr.io
tasks:
1. classification_tf2
2. efficientdet_tf2
5.0.0-tf1.15.5:
docker_registry: nvcr.io
tasks:
1. bpnet
2. classification_tf1
3. converter
4. detectnet_v2
5. dssd
6. efficientdet_tf1
7. faster_rcnn
8. fpenet
9. lprnet
10. mask_rcnn
11. multitask_classification
12. retinanet
13. ssd
14. unet
15. yolo_v3
16. yolo_v4
17. yolo_v4_tiny
5.2.0-pyt2.1.0:
docker_registry: nvcr.io
tasks:
1. action_recognition
2. centerpose
3. deformable_detr
4. dino
5. mal
6. ml_recog
7. ocdnet
8. ocrnet
9. optical_inspection
10. pointpillars
11. pose_classification
12. re_identification
13. visual_changenet
5.2.0.1-pyt1.14.0:
docker_registry: nvcr.io
tasks:
1. classification_pyt
2. segformer
dataset:
dockers:
nvidia/tao/tao-toolkit:
5.2.0-data-services:
docker_registry: nvcr.io
tasks:
1. augmentation
2. auto_label
3. annotations
4. analytics
deploy:
dockers:
nvidia/tao/tao-toolkit:
5.2.0-deploy:
docker_registry: nvcr.io
tasks:
1. visual_changenet
2. centerpose
3. classification_pyt
4. classification_tf1
5. classification_tf2
6. deformable_detr
7. detectnet_v2
8. dino
9. dssd
10. efficientdet_tf1
11. efficientdet_tf2
12. faster_rcnn
13. lprnet
14. mask_rcnn
15. ml_recog
16. multitask_classification
17. ocdnet
18. ocrnet
19. optical_inspection
20. retinanet
21. segformer
22. ssd
23. trtexec
24. unet
25. yolo_v3
26. yolo_v4
27. yolo_v4_tiny
format_version: 3.0
toolkit_version: 5.2.0.1
published_date: 01/16/2024

With 512X512 color images My training set is around 4K images.

If I use a batch size of 8, it uses all the memory, and the ETA after 1000 cycles is 2 days.

If I use batch size of 1, it uses about 3GB and the ETA is 5 hours… ???

This is counterintuitive. Any ideas why?

Many thanks!!

David

This is not the apple-to-apple way to compare.It is important to compare the same no. of samples.

For example, you can compare something like:
8 bs per gpu - 1GPU - 4 iters time,
1 bs per gpu - 1GPU - 32 iters time,

Above two cases will cover 32 samples and will be fair comparison.

More example,
if you set max_iters:1 and bs_per_gpu:2 , gpu:1 ==> then 2 images per iter. For each iter, the dataloader will randomly select 2 images.
if you set max_iters:1 and bs_per_gpu:2 , gpu:2 ==> then 4 images per iter. For each iter, the dataloader will randomly select 4 images.

And, images per iter is not related to max_iters.

@Morganh Thanks!

If max_iters : 200000 in both cases, in segformer, why would a batch of 1 be faster than a batch of 8? Much faster…

It’s not intuitive for me. My expectation is that when doing a batch of 8 most of the processing for each image is in parallel on the GPU, and a batch of 8 should be approximately T(batch of 1)/8…

As mentioned above, because the total samples for running bs8 is 8x more than bs1.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.