AssertionError: The number of GPUs ([1]) must be the same as the number of GPU indices (4) provided

@Morganh

!tlt ssd train --gpus 4 --gpu_index=$GPU_INDEX \
               -e $SPECS_DIR/ssd_train_resnet18_kitti.txt \
               -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
               -k $KEY \
               -m $USER_EXPERIMENT_DIR/pretrained_resnet18/tlt_pretrained_object_detection_vresnet18/resnet_18.hdf5

Throwing the error log mentioned below.
When I am using --gpus 4. I am having 4gpus (4 v100s).
When I used 1 training started, but not with 4.

Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/ssd", line 8, in <module>
    sys.exit(main())
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/entrypoint/ssd.py", line 12, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 315, in launch_job
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 224, in set_gpu_info_single_node
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 201, in check_valid_gpus

**AssertionError: The number of GPUs ([1]) must be the same as the number of GPU indices (4) provided.**
2021-06-18 10:54:49,490 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The ssd_train_resnet18_kitti.txt looks like this:

.
.
.

training_config {
  batch_size_per_gpu: 16
  num_epochs: 2
  enable_qat: false
  learning_rate {
  soft_start_annealing_schedule {
.
.
.

&& nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   35C    P0    37W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   35C    P0    35W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   36C    P0    41W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   36C    P0    42W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

When you use 1 gpu, please set a new result folder. For example,
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned_new

@Morganh
Still It’s not working…

Do we need to do anything with $GPU_INDEX??

What is the $GPU_INDEX when you run 1 gpu?

It was 1.

I tried by changing it, still the same.

Can you set $GPU_INDEX to 0 when you run 1 gpu?

Yeah, I tried with GPU_INDEX = 0 for 1 GPU it started the training.
but for 2 GPU, it throws :

AssertionError: The number of GPUs ([0]) must be the same as the number of GPU indices (2) provided.

When you run with 2gpus, what is the “–gpus” and “$GPU_INDEX” ?

–gpus 2
$GPU_INDEX = 0

Please try below.
--gpu 2
$GPU_INDEX = 0,1

To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
2021-06-18 13:27:28,176 [INFO] root: Registry: ['nvcr.io']
2021-06-18 13:27:28,238 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
usage: ssd train [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS]
                 [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp]
                 [--log_file LOG_FILE] -e EXPERIMENT_SPEC_FILE -r RESULTS_DIR
                 -k KEY [-m RESUME_MODEL_WEIGHTS]
                 [--initial_epoch INITIAL_EPOCH] [--arch {None,ssd,dssd}]
                 [--use_multiprocessing]
                 {evaluate,export,inference,prune,train} ...
ssd train: error: argument --gpu_index: invalid int value: '0,1'
2021-06-18 13:27:33,202 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Sorry, could you try below again?
$GPU_INDEX = 0 1

No @Morganh
It’s not working at all…
This problem is blocker for me rightnow.

To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
2021-06-18 13:32:29,126 [INFO] root: Registry: ['nvcr.io']
2021-06-18 13:32:29,187 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
usage: ssd train [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS]
                 [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp]
                 [--log_file LOG_FILE] -e EXPERIMENT_SPEC_FILE -r RESULTS_DIR
                 -k KEY [-m RESUME_MODEL_WEIGHTS]
                 [--initial_epoch INITIAL_EPOCH] [--arch {None,ssd,dssd}]
                 [--use_multiprocessing]
                 {evaluate,export,inference,prune,train} ...
ssd train: error: invalid choice: '1' (choose from 'evaluate', 'export', 'inference', 'prune', 'train')
2021-06-18 13:32:34,146 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Could you try as below?
! tlt ssd train --gpus 2 --gpu_index 0 1 -e xxx

1 Like

Thanks This way it worked !!

print("To run with multigpu, please change --gpus based on the number of available GPUs in your machine.")
! tlt ssd train --gpus 4 --gpu_index 0 1 2 3 \
               -e $SPECS_DIR/ssd_train_resnet18_kitti.txt \
               -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned_4GPUs\
               -k $KEY \
               -m $USER_EXPERIMENT_DIR/pretrained_resnet18/tlt_pretrained_object_detection_vresnet18/resnet_18.hdf5

Sorry for the inconvenience. I will sync internally to improve the tlt user guide.

1 Like

Yeah Please,
There were multiple confusions in the beginning.
like,

  1. How the container setup gonna work?
  2. This MultiGPU training stuffs
    etc.

Thanks!!

This is mentioned in https://docs.nvidia.com/tlt/tlt-user-guide/text/tlt_launcher.html

1 Like

For multiGPU training, see https://docs.nvidia.com/tlt/tlt-user-guide/text/qat_and_amp_for_training.html?highlight=multi#optimizing-the-training-pipeline

Currently, only single-node multi-GPU training is supported.

What is the significance of this?