Loss function in classifier training causes a ValueError

monocongo · December 17, 2019, 4:30am

I am running a training of the classifier using ResNet18 and I get the following error at the first epoch:

Epoch 1/80
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 27, in main
  File "./makenet/scripts/train.py", line 410, in main
  File "./makenet/scripts/train.py", line 385, in run_experiment
  File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training_generator.py", line 217, in fit_generator
    class_weight=class_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1211, in train_on_batch
    class_weight=class_weight)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 809, in _standardize_user_data
    y, self._feed_loss_fns, feed_output_shapes)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training_utils.py", line 273, in check_loss_and_target_compatibility
    ' while using as loss `categorical_crossentropy`. '
ValueError: You are passing a target array of shape (205, 1) while using as loss `categorical_crossentropy`. `categorical_crossentropy` expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:

from keras.utils import to_categorical
y_binary = to_categorical(y_int)


Alternatively, you can use the loss function `sparse_categorical_crossentropy` instead, which does expect integer targets.

My images are of a single class and are sized to 1024x768 resolution. Below is my specification file:

model_config {

  # Model architecture can be chosen from:
  # ['resnet', 'vgg', 'googlenet', 'alexnet', 'mobilenet_v1', 'mobilenet_v2', 'squeezenet']

  arch: "resnet"

  # for resnet --> n_layers can be [10, 18, 50]
  # for vgg --> n_layers can be [16, 19]

  n_layers: 18
  use_bias: True
  use_batch_norm: True
  all_projections: True
  use_pooling: False
  freeze_bn: False
  freeze_blocks: 0
  freeze_blocks: 1

  # image size should be "3, X, Y", where X,Y >= 16
  input_image_size: "3,1024,768"
}

eval_config {
  eval_dataset_path: "/workspace/experiments/dataset/test"
  model_path: "/workspace/experiments/output/weights/classifier_delivery_epoch_200.tlt"
  top_k: 3
  #conf_threshold: 0.5
  batch_size: 256
  n_workers: 8

}

train_config {
  train_dataset_path: "/workspace/experiments/dataset/train"
  val_dataset_path: "/workspace/experiments/dataset/valid"

  # optimizer can be chosen from ['adam', 'sgd']

  optimizer: "sgd"
  batch_size_per_gpu: 256
  n_epochs: 80
  n_workers: 16

  # regularizer
  reg_config {
    type: "L2"
    scope: "Conv2D,Dense"
    weight_decay: 0.00005

  }

  # learning_rate

  lr_config {

    # "step" and "soft_anneal" are supported.

    scheduler: "soft_anneal"

    # "soft_anneal" stands for soft annealing learning rate scheduler.
    # the following 4 parameters should be specified if "soft_anneal" is used.
    learning_rate: 0.005
    soft_start: 0.056
    annealing_points: "0.3, 0.6, 0.8"
    annealing_divider: 10
    # "step" stands for step learning rate scheduler.
    # the following 3 parameters should be specified if "step" is used.
    # learning_rate: 0.006
    # step_size: 10
    # gamma: 0.1
  }
}

There doesn’t appear to be any way to affect this in the specification file and without access to the code I’m stuck without NVIDIA’s help. Please advise, thanks in advance.

Morganh · December 17, 2019, 9:35am

Hi monocongo
Refer to https://github.com/OlafenwaMoses/ImageAI/issues/40

I can reproduce your issue with only one class as you mentioned.
Please create at least 2 classes for training.

For two classes or more, I can run the training successfully.

monocongo · December 17, 2019, 1:44pm

Thanks for this help. I have added some random images into the dataset and now I get the following error when I run the training:

Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
--------------------------------------------------------------------------
[[56220,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: 8a4c5074e3dc

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
2019-12-17 13:32:59.536160: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-12-17 13:32:59.536328: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-12-17 13:32:59.536662: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-12-17 13:32:59.536865: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 26, in main
  File "./makenet/scripts/train.py", line 41, in <module>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1551, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 676, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 26, in main
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 26, in main
  File "./makenet/scripts/train.py", line 41, in <module>
  File "./makenet/scripts/train.py", line 41, in <module>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1551, in __init__
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1551, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 676, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 676, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '2' but visible device count is 1
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '3' but visible device count is 1
2019-12-17 13:32:59.677676: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x63a9660 executing computations on platform CUDA. Devices:
2019-12-17 13:32:59.677709: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2019-12-17 13:32:59.680042: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3300000000 Hz
2019-12-17 13:32:59.681322: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x6518550 executing computations on platform Host. Devices:
2019-12-17 13:32:59.681390: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-12-17 13:32:59.681643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:17:00.0
totalMemory: 10.76GiB freeMemory: 10.60GiB
2019-12-17 13:32:59.681684: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-12-17 13:32:59.683818: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-17 13:32:59.683849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-12-17 13:32:59.683867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-12-17 13:32:59.683994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10311 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5)
2019-12-17 13:32:59,688 [INFO] iva.makenet.scripts.train: Loading experiment spec at specs/classification_resnet_train.txt.
2019-12-17 13:32:59,690 [INFO] iva.makenet.spec_handling.spec_loader: Merging specification from specs/classification_resnet_train.txt
Found 211 images belonging to 2 classes.
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
2019-12-17 13:32:59,826 [INFO] iva.makenet.scripts.train: Processing dataset (train): /workspace/experiments/dataset/train
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[56220,1],2]
  Exit code:    1
--------------------------------------------------------------------------
[8a4c5074e3dc:02826] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[8a4c5074e3dc:02826] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Can you make any sense of this?

monocongo · December 17, 2019, 2:05pm

It turns out that the above error is caused by using “–gpus 4” in the training command. My machine has 4 GPUs, so why the TLT code can only see 1 of these is not clear. If I use a single GPU (i.e. “–gpus 1”) then the above issue goes away, but I’m missing out on 75% of my machine’s capacity. Can you advise?

monocongo · December 17, 2019, 2:18pm

The inability to see more than a single GPU is a result of user error – I started my Docker container with only a single GPU:

$ sudo docker run --gpus device=0 -it -v ${TLT}:/workspace/experiments nvcr.io/nvidia/tlt-streamanalytics:v1.0.1_py2 /bin/bash

Topic		Replies	Views
ValueError: need more than 1 value to unpack TAO Toolkit	13	948	October 12, 2021
Tlt train error: Value 'sm_86' is not defined for option 'gpu-name' TAO Toolkit	2	3616	October 12, 2021
Error training Faster RCNN model TAO Toolkit	17	1611	October 12, 2021
Tlt-train classification error TAO Toolkit	7	633	October 12, 2021
TLT Classification example loss and val_acc unable to converge during training TAO Toolkit nvbugs	12	701	October 12, 2021
Errors in Training, 0 or Nan mAP, Low Loss, Tutorial Config TAO Toolkit	35	1910	October 12, 2021
Error: Transfer learning toolkit for classification failed to setting image size TAO Toolkit	9	926	October 12, 2021
Unable to train SSD-Resnet-18 TAO Toolkit	16	2052	October 12, 2021
Custom dataset -- ValueError: steps_per_epoch must be > 0 TAO Toolkit	6	1374	October 12, 2021
Training Custom FasterRCNN resnet50 Object detection issue TAO Toolkit	9	1170	October 12, 2021

Loss function in classifier training causes a ValueError

Related topics