Loss function in classifier training causes a ValueError

I am running a training of the classifier using ResNet18 and I get the following error at the first epoch:

Epoch 1/80
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 27, in main
  File "./makenet/scripts/train.py", line 410, in main
  File "./makenet/scripts/train.py", line 385, in run_experiment
  File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training_generator.py", line 217, in fit_generator
    class_weight=class_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1211, in train_on_batch
    class_weight=class_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 809, in _standardize_user_data
    y, self._feed_loss_fns, feed_output_shapes)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training_utils.py", line 273, in check_loss_and_target_compatibility
    ' while using as loss `categorical_crossentropy`. '
ValueError: You are passing a target array of shape (205, 1) while using as loss `categorical_crossentropy`. `categorical_crossentropy` expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:

from keras.utils import to_categorical
y_binary = to_categorical(y_int)


Alternatively, you can use the loss function `sparse_categorical_crossentropy` instead, which does expect integer targets.

My images are of a single class and are sized to 1024x768 resolution. Below is my specification file:

model_config {

  # Model architecture can be chosen from:
  # ['resnet', 'vgg', 'googlenet', 'alexnet', 'mobilenet_v1', 'mobilenet_v2', 'squeezenet']

  arch: "resnet"

  # for resnet --> n_layers can be [10, 18, 50]
  # for vgg --> n_layers can be [16, 19]

  n_layers: 18
  use_bias: True
  use_batch_norm: True
  all_projections: True
  use_pooling: False
  freeze_bn: False
  freeze_blocks: 0
  freeze_blocks: 1

  # image size should be "3, X, Y", where X,Y >= 16
  input_image_size: "3,1024,768"
}

eval_config {
  eval_dataset_path: "/workspace/experiments/dataset/test"
  model_path: "/workspace/experiments/output/weights/classifier_delivery_epoch_200.tlt"
  top_k: 3
  #conf_threshold: 0.5
  batch_size: 256
  n_workers: 8

}

train_config {
  train_dataset_path: "/workspace/experiments/dataset/train"
  val_dataset_path: "/workspace/experiments/dataset/valid"

  # optimizer can be chosen from ['adam', 'sgd']

  optimizer: "sgd"
  batch_size_per_gpu: 256
  n_epochs: 80
  n_workers: 16

  # regularizer
  reg_config {
    type: "L2"
    scope: "Conv2D,Dense"
    weight_decay: 0.00005

  }

  # learning_rate

  lr_config {

    # "step" and "soft_anneal" are supported.

    scheduler: "soft_anneal"

    # "soft_anneal" stands for soft annealing learning rate scheduler.
    # the following 4 parameters should be specified if "soft_anneal" is used.
    learning_rate: 0.005
    soft_start: 0.056
    annealing_points: "0.3, 0.6, 0.8"
    annealing_divider: 10
    # "step" stands for step learning rate scheduler.
    # the following 3 parameters should be specified if "step" is used.
    # learning_rate: 0.006
    # step_size: 10
    # gamma: 0.1
  }
}

There doesn’t appear to be any way to affect this in the specification file and without access to the code I’m stuck without NVIDIA’s help. Please advise, thanks in advance.

Hi monocongo
Refer to https://github.com/OlafenwaMoses/ImageAI/issues/40

I can reproduce your issue with only one class as you mentioned.
Please create at least 2 classes for training.

For two classes or more, I can run the training successfully.

Thanks for this help. I have added some random images into the dataset and now I get the following error when I run the training:

Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
--------------------------------------------------------------------------
[[56220,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: 8a4c5074e3dc

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
2019-12-17 13:32:59.536160: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-12-17 13:32:59.536328: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-12-17 13:32:59.536662: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-12-17 13:32:59.536865: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 26, in main
  File "./makenet/scripts/train.py", line 41, in <module>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1551, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 676, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 26, in main
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 26, in main
  File "./makenet/scripts/train.py", line 41, in <module>
  File "./makenet/scripts/train.py", line 41, in <module>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1551, in __init__
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1551, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 676, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 676, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '2' but visible device count is 1
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '3' but visible device count is 1
2019-12-17 13:32:59.677676: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x63a9660 executing computations on platform CUDA. Devices:
2019-12-17 13:32:59.677709: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2019-12-17 13:32:59.680042: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3300000000 Hz
2019-12-17 13:32:59.681322: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x6518550 executing computations on platform Host. Devices:
2019-12-17 13:32:59.681390: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-12-17 13:32:59.681643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:17:00.0
totalMemory: 10.76GiB freeMemory: 10.60GiB
2019-12-17 13:32:59.681684: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-12-17 13:32:59.683818: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-17 13:32:59.683849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-12-17 13:32:59.683867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-12-17 13:32:59.683994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10311 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5)
2019-12-17 13:32:59,688 [INFO] iva.makenet.scripts.train: Loading experiment spec at specs/classification_resnet_train.txt.
2019-12-17 13:32:59,690 [INFO] iva.makenet.spec_handling.spec_loader: Merging specification from specs/classification_resnet_train.txt
Found 211 images belonging to 2 classes.
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
2019-12-17 13:32:59,826 [INFO] iva.makenet.scripts.train: Processing dataset (train): /workspace/experiments/dataset/train
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[56220,1],2]
  Exit code:    1
--------------------------------------------------------------------------
[8a4c5074e3dc:02826] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[8a4c5074e3dc:02826] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Can you make any sense of this?

It turns out that the above error is caused by using “–gpus 4” in the training command. My machine has 4 GPUs, so why the TLT code can only see 1 of these is not clear. If I use a single GPU (i.e. “–gpus 1”) then the above issue goes away, but I’m missing out on 75% of my machine’s capacity. Can you advise?

The inability to see more than a single GPU is a result of user error – I started my Docker container with only a single GPU:

$ sudo docker run --gpus device=0 -it -v ${TLT}:/workspace/experiments nvcr.io/nvidia/tlt-streamanalytics:v1.0.1_py2 /bin/bash