How to train a model with multiple GPUs

LoveNvidia · June 30, 2021, 1:13pm

• Hardware (T4/V100/Xavier/Nano/etc) :GTX1080ti
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc): Detectnet_v2 and Classification
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) : tlt:3.0 and docker_tag: v3.0-dp-py3

Hi,
I want to train a model on multiple GPUs in one system and tlt has --gpus and --gpu_index, and I set --gpus=2 and --gpu_index=0;1, I get error and doesn’t allow to continue training, I also check --gpu_index=0,1 and --gpu_index=[0,1], I get same error.

Morganh · June 30, 2021, 3:29pm

Please set
--gpu_index 0 1

LoveNvidia · July 1, 2021, 8:03am

@Morganh,
Don’t work

Morganh · July 1, 2021, 9:01am

Can you share the command line and full log?

Related topic: AssertionError: The number of GPUs ([1]) must be the same as the number of GPU indices (4) provided - #14 by Morganh

LoveNvidia · July 1, 2021, 9:18am

tlt lprnet inference --gpus 2 --gpu_index 0 1 -i /workspace/tlt-experiments/ocr/data/train/image -e /workspace/tlt-experiments/ocr/specs/tutorial_spec.txt -m /workspace/tlt-experiments/ocr/experiment_dir_unpruned/weights/lprnet_epoch-60.tlt -k nvidia_tlt

Morganh · July 1, 2021, 9:19am

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Do you have the full log?

system · August 30, 2021, 9:19am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
AssertionError: The number of GPUs ([1]) must be the same as the number of GPU indices (4) provided TAO Toolkit	22	2519	October 12, 2021
Training a TLT model with multiple computers TAO Toolkit	9	623	October 12, 2021
Training with multiple GPUs has error using TAO toolkit TAO Toolkit	17	1173	July 19, 2022
How to use multi GPU training in tao-toolkit-api(K8s) TAO Toolkit	14	988	May 19, 2023
Issue about training a detectnet_v2 in TLT 3.0 using all gpus inside a docker container TAO Toolkit	10	974	September 27, 2021
Training in TAO 3.0 in DGX inside docker-compose for multi-gpus MIG TAO Toolkit	5	1372	March 1, 2022
Not able to train on other systems TAO Toolkit	3	562	March 4, 2022
Train faster-rcnn with multiple images per iteration TAO Toolkit	4	931	October 12, 2021
TAO yolov4_tiny training fails with error TAO Toolkit	4	560	February 2, 2023
Training multiple models on multiple GPUs hangs Frameworks pytorch	0	813	February 19, 2021

How to train a model with multiple GPUs

Related topics