Hi,
I am currently trying to train alexnet with TLT (installed from the NGC container v1.0.1_py2).
I copied the jupyter classification example notebook (which I tried and it works perfectly) and adapted it for my own use case.
Everything works well until I try to run tlt-train:
Using TensorFlow backend.
--------------------------------------------------------------------------
[[59990,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: 721af5e6106b
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
2020-03-26 15:56:02.875967: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-03-26 15:56:02.993264: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-26 15:56:02.994166: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5aab100 executing computations on platform CUDA. Devices:
2020-03-26 15:56:02.994201: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-03-26 15:56:02.996237: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-03-26 15:56:02.996501: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5bc57d0 executing computations on platform Host. Devices:
2020-03-26 15:56:02.996533: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2020-03-26 15:56:02.996683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.65GiB
2020-03-26 15:56:02.996706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-03-26 15:56:02.997676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-26 15:56:02.997694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2020-03-26 15:56:02.997705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2020-03-26 15:56:02.997775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14249 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-03-26 15:56:02,999 [INFO] iva.makenet.scripts.train: Loading experiment spec at /workspace/tlt-experiments/data/DEL/specs/classification_spec.cfg.
2020-03-26 15:56:03,000 [INFO] iva.makenet.spec_handling.spec_loader: Merging specification from /workspace/tlt-experiments/data/DEL/specs/classification_spec.cfg
Found 1600 images belonging to 2 classes.
2020-03-26 15:56:03,105 [INFO] iva.makenet.scripts.train: Processing dataset (train): /workspace/tlt-experiments/data/DEL/split/train
Found 229 images belonging to 2 classes.
2020-03-26 15:56:03,207 [INFO] iva.makenet.scripts.train: Processing dataset (validation): /workspace/tlt-experiments/data/DEL/split/val
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-03-26 15:56:03,213 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 3, 640, 480) 0
_________________________________________________________________
conv1 (Conv2D) (None, 96, 160, 120) 34944
_________________________________________________________________
pool1 (MaxPooling2D) (None, 96, 80, 60) 0
_________________________________________________________________
conv2 (Conv2D) (None, 256, 80, 60) 614656
_________________________________________________________________
pool2 (MaxPooling2D) (None, 256, 40, 30) 0
_________________________________________________________________
conv3 (Conv2D) (None, 384, 40, 30) 885120
_________________________________________________________________
conv4 (Conv2D) (None, 384, 40, 30) 1327488
_________________________________________________________________
conv5 (Conv2D) (None, 256, 40, 30) 884992
_________________________________________________________________
avg_pool (AveragePooling2D) (None, 256, 1, 1) 0
_________________________________________________________________
flatten (Flatten) (None, 256) 0
_________________________________________________________________
predictions (Dense) (None, 2) 514
=================================================================
Total params: 3,747,714
Trainable params: 3,747,714
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
2020-03-26 15:56:04,280 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/30
[721af5e6106b:00173] PMIX ERROR: UNPACK-PAST-END in file client/pmix_client.c at line 115
What can i do to avoid that error?
If that can be useful, here is the specs file I use.
model_config {
# Model architecture can be chosen from:
# ['resnet', 'vgg', 'googlenet', 'alexnet', 'mobilenet_v1', 'mobilenet_v2', 'squeezenet']
arch: "alexnet"
# for resnet --> n_layers can be [10, 18, 50]
# for vgg --> n_layers can be [16, 19]
#n_layers: 18
#use_bias: True
#use_batch_norm: True
#all_projections: True
#use_pooling: False
#freeze_bn: False
#freeze_blocks: 0
#freeze_blocks: 1
# image size should be "3, X, Y", where X,Y >= 16
input_image_size: "3,640,480"
}
eval_config {
eval_dataset_path: "/workspace/tlt-experiments/data/DEL/split/test"
model_path: "/workspace/tlt-experiments/output/weights/alexnet_030.tlt"
top_k: 1
#conf_threshold: 0.5
batch_size: 256
n_workers: 8
}
train_config {
train_dataset_path: "/workspace/tlt-experiments/data/DEL/split/train"
val_dataset_path: "/workspace/tlt-experiments/data/DEL/split/val"
pretrained_model_path: "/workspace/tlt-experiments/pretrained_alexnet/tlt_alexnet_classification_v1/alexnet.hdf5"
# optimizer can be chosen from ['adam', 'sgd']
optimizer: "sgd"
batch_size_per_gpu: 256
n_epochs: 30
n_workers: 16
# regularizer
#reg_config {
# type: "L2"
# scope: "Conv2D,Dense"
# weight_decay: 0.00005
#}
# learning_rate
lr_config {
# "step" and "soft_anneal" are supported.
scheduler: "step"
# "soft_anneal" stands for soft annealing learning rate scheduler.
# the following 4 parameters should be specified if "soft_anneal" is used.
#learning_rate: 0.005
#soft_start: 0.056
#annealing_points: "0.3, 0.6, 0.8"
#annealing_divider: 10
# "step" stands for step learning rate scheduler.
# the following 3 parameters should be specified if "step" is used.
learning_rate: 0.0005
step_size: 33
gamma: 0.1
}
}