TLT with alexnet -> PMIX ERROR

Hi,

I am currently trying to train alexnet with TLT (installed from the NGC container v1.0.1_py2).

I copied the jupyter classification example notebook (which I tried and it works perfectly) and adapted it for my own use case.
Everything works well until I try to run tlt-train:

Using TensorFlow backend.
--------------------------------------------------------------------------
[[59990,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: 721af5e6106b

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
2020-03-26 15:56:02.875967: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-03-26 15:56:02.993264: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-26 15:56:02.994166: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5aab100 executing computations on platform CUDA. Devices:
2020-03-26 15:56:02.994201: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-03-26 15:56:02.996237: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-03-26 15:56:02.996501: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5bc57d0 executing computations on platform Host. Devices:
2020-03-26 15:56:02.996533: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-03-26 15:56:02.996683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.65GiB
2020-03-26 15:56:02.996706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-03-26 15:56:02.997676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-26 15:56:02.997694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-03-26 15:56:02.997705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-03-26 15:56:02.997775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14249 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-03-26 15:56:02,999 [INFO] iva.makenet.scripts.train: Loading experiment spec at /workspace/tlt-experiments/data/DEL/specs/classification_spec.cfg.
2020-03-26 15:56:03,000 [INFO] iva.makenet.spec_handling.spec_loader: Merging specification from /workspace/tlt-experiments/data/DEL/specs/classification_spec.cfg
Found 1600 images belonging to 2 classes.
2020-03-26 15:56:03,105 [INFO] iva.makenet.scripts.train: Processing dataset (train): /workspace/tlt-experiments/data/DEL/split/train
Found 229 images belonging to 2 classes.
2020-03-26 15:56:03,207 [INFO] iva.makenet.scripts.train: Processing dataset (validation): /workspace/tlt-experiments/data/DEL/split/val
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-03-26 15:56:03,213 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 3, 640, 480)       0         
_________________________________________________________________
conv1 (Conv2D)               (None, 96, 160, 120)      34944     
_________________________________________________________________
pool1 (MaxPooling2D)         (None, 96, 80, 60)        0         
_________________________________________________________________
conv2 (Conv2D)               (None, 256, 80, 60)       614656    
_________________________________________________________________
pool2 (MaxPooling2D)         (None, 256, 40, 30)       0         
_________________________________________________________________
conv3 (Conv2D)               (None, 384, 40, 30)       885120    
_________________________________________________________________
conv4 (Conv2D)               (None, 384, 40, 30)       1327488   
_________________________________________________________________
conv5 (Conv2D)               (None, 256, 40, 30)       884992    
_________________________________________________________________
avg_pool (AveragePooling2D)  (None, 256, 1, 1)         0         
_________________________________________________________________
flatten (Flatten)            (None, 256)               0         
_________________________________________________________________
predictions (Dense)          (None, 2)                 514       
=================================================================
Total params: 3,747,714
Trainable params: 3,747,714
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
2020-03-26 15:56:04,280 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/30
[721af5e6106b:00173] PMIX ERROR: UNPACK-PAST-END in file client/pmix_client.c at line 115

What can i do to avoid that error?

If that can be useful, here is the specs file I use.

model_config {

  # Model architecture can be chosen from:
  # ['resnet', 'vgg', 'googlenet', 'alexnet', 'mobilenet_v1', 'mobilenet_v2', 'squeezenet']

  arch: "alexnet"

  # for resnet --> n_layers can be [10, 18, 50]
  # for vgg --> n_layers can be [16, 19]

  #n_layers: 18
  #use_bias: True
  #use_batch_norm: True
  #all_projections: True
  #use_pooling: False
  #freeze_bn: False
  #freeze_blocks: 0
  #freeze_blocks: 1

  # image size should be "3, X, Y", where X,Y >= 16
  input_image_size: "3,640,480"
}

eval_config {
  eval_dataset_path: "/workspace/tlt-experiments/data/DEL/split/test"
  model_path: "/workspace/tlt-experiments/output/weights/alexnet_030.tlt"
  top_k: 1
  #conf_threshold: 0.5
  batch_size: 256
  n_workers: 8

}

train_config {
  train_dataset_path: "/workspace/tlt-experiments/data/DEL/split/train"
  val_dataset_path: "/workspace/tlt-experiments/data/DEL/split/val"
  pretrained_model_path: "/workspace/tlt-experiments/pretrained_alexnet/tlt_alexnet_classification_v1/alexnet.hdf5"
  # optimizer can be chosen from ['adam', 'sgd']

  optimizer: "sgd"
  batch_size_per_gpu: 256
  n_epochs: 30
  n_workers: 16

  # regularizer
  #reg_config {
  #  type: "L2"
  #  scope: "Conv2D,Dense"
  #  weight_decay: 0.00005

  #}

  # learning_rate

  lr_config {

# "step" and "soft_anneal" are supported.

scheduler: "step"

# "soft_anneal" stands for soft annealing learning rate scheduler.
# the following 4 parameters should be specified if "soft_anneal" is used.
#learning_rate: 0.005
#soft_start: 0.056
#annealing_points: "0.3, 0.6, 0.8"
#annealing_divider: 10
# "step" stands for step learning rate scheduler.
# the following 3 parameters should be specified if "step" is used.
learning_rate: 0.0005
step_size: 33
gamma: 0.1
  }
}

Hi timothe.frignac,
If you are seeing above error, please try to lower your training batch size.

Hi,

Lowering the batch size to 64 seems to fix the issue. Thank you!