AssertionError: The number of GPUs ([1]) must be the same as the number of GPU indices (4) provided

@Morganh

!tlt ssd train --gpus 4 --gpu_index=$GPU_INDEX \
               -e $SPECS_DIR/ssd_train_resnet18_kitti.txt \
               -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
               -k $KEY \
               -m $USER_EXPERIMENT_DIR/pretrained_resnet18/tlt_pretrained_object_detection_vresnet18/resnet_18.hdf5

Throwing the error log mentioned below.
When I am using --gpus 4. I am having 4gpus (4 v100s).
When I used 1 training started, but not with 4.

Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/ssd", line 8, in <module>
    sys.exit(main())
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/entrypoint/ssd.py", line 12, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 315, in launch_job
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 224, in set_gpu_info_single_node
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 201, in check_valid_gpus

**AssertionError: The number of GPUs ([1]) must be the same as the number of GPU indices (4) provided.**
2021-06-18 10:54:49,490 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The ssd_train_resnet18_kitti.txt looks like this:

.
.
.

training_config {
  batch_size_per_gpu: 16
  num_epochs: 2
  enable_qat: false
  learning_rate {
  soft_start_annealing_schedule {
.
.
.

&& nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   35C    P0    37W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   35C    P0    35W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   36C    P0    41W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   36C    P0    42W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

When you use 1 gpu, please set a new result folder. For example,
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned_new

@Morganh
Still It’s not working…

Do we need to do anything with $GPU_INDEX??

What is the $GPU_INDEX when you run 1 gpu?

It was 1.

I tried by changing it, still the same.

Can you set $GPU_INDEX to 0 when you run 1 gpu?

Yeah, I tried with GPU_INDEX = 0 for 1 GPU it started the training.
but for 2 GPU, it throws :

AssertionError: The number of GPUs ([0]) must be the same as the number of GPU indices (2) provided.

When you run with 2gpus, what is the “–gpus” and “$GPU_INDEX” ?

–gpus 2
$GPU_INDEX = 0

Please try below.
--gpu 2
$GPU_INDEX = 0,1

To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
2021-06-18 13:27:28,176 [INFO] root: Registry: ['nvcr.io']
2021-06-18 13:27:28,238 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
usage: ssd train [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS]
                 [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp]
                 [--log_file LOG_FILE] -e EXPERIMENT_SPEC_FILE -r RESULTS_DIR
                 -k KEY [-m RESUME_MODEL_WEIGHTS]
                 [--initial_epoch INITIAL_EPOCH] [--arch {None,ssd,dssd}]
                 [--use_multiprocessing]
                 {evaluate,export,inference,prune,train} ...
ssd train: error: argument --gpu_index: invalid int value: '0,1'
2021-06-18 13:27:33,202 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Sorry, could you try below again?
$GPU_INDEX = 0 1

No @Morganh
It’s not working at all…
This problem is blocker for me rightnow.

To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
2021-06-18 13:32:29,126 [INFO] root: Registry: ['nvcr.io']
2021-06-18 13:32:29,187 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
usage: ssd train [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS]
                 [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp]
                 [--log_file LOG_FILE] -e EXPERIMENT_SPEC_FILE -r RESULTS_DIR
                 -k KEY [-m RESUME_MODEL_WEIGHTS]
                 [--initial_epoch INITIAL_EPOCH] [--arch {None,ssd,dssd}]
                 [--use_multiprocessing]
                 {evaluate,export,inference,prune,train} ...
ssd train: error: invalid choice: '1' (choose from 'evaluate', 'export', 'inference', 'prune', 'train')
2021-06-18 13:32:34,146 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Could you try as below?
! tlt ssd train --gpus 2 --gpu_index 0 1 -e xxx

1 Like

Thanks This way it worked !!

print("To run with multigpu, please change --gpus based on the number of available GPUs in your machine.")
! tlt ssd train --gpus 4 --gpu_index 0 1 2 3 \
               -e $SPECS_DIR/ssd_train_resnet18_kitti.txt \
               -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned_4GPUs\
               -k $KEY \
               -m $USER_EXPERIMENT_DIR/pretrained_resnet18/tlt_pretrained_object_detection_vresnet18/resnet_18.hdf5

Sorry for the inconvenience. I will sync internally to improve the tlt user guide.

1 Like

Yeah Please,
There were multiple confusions in the beginning.
like,

  1. How the container setup gonna work?
  2. This MultiGPU training stuffs
    etc.

Thanks!!

This is mentioned in TLT Launcher — Transfer Learning Toolkit 3.0 documentation

1 Like

For multiGPU training, see Optimizing the Training Pipeline — Transfer Learning Toolkit 3.0 documentation

Currently, only single-node multi-GPU training is supported.

What is the significance of this?